adv-nlp-2
Transformers
A downside is a fixed-size input. This is why all LLMs have a ‘context window’.
Garden path sentences are sentences where ‘halfway in’ reading it out loud, you have to restart reading the sentence. For example: “The old man the boat”. Hint: ‘The old’ is the subject.
Multi-task learning: training a model on multiple tasks. E.g. first weights are trained based on MLM, then some classification.
Transfer learning: Type of multi-task learning where we only really care about one of the tasks.
Fine-tuning: training/changing the head (classification layers) for a specific task.
Continual-pretraining:
BERT
Bidirectional transformer, trained for MLM and next sentence prediction.
For example BERT (large):
- L=24 (12 transformer layers) (J&R refers to these as ‘blocks’)
- H=1024 (size of hidden layers)
- A=16 (number of attention heads)
One ‘set of arrows’ (in a transformer model) is an attention head. Multiple attention heads allow each pair of words to consider different types of relations simultaneously.
BERT sequence labeling
One sentence can express multiple events. So the classification is event-specifi, and the event needs to be encoded.
BERT uses WordPiece subword tokenization, and 512 different Position Embeddings (max input length)
Segment embeddings are used for next sentence prediction.
Special tokens:
- CLS (classification token, starts an input)
- SEP (separates sentences for next sentence prediction)
- MASK (for MLM)
- UNK (for unknown characters, BERT was trained on English texts only)
- PAD ()
Reasons for subword tokenization
Derive embeddings from a preferably small, fixed, and robust vocabulary, that generalizes to unseen words and neologisms.
Types of subword embeddings:
- WordPiece (used by BERT)
- Instead of using frequencies (as in BPE), a more complex formula is used to expand the vocabulary.
- BPE (used by GPT and RoBERTa)
- For k merges: most frequent adjacent tokens are merged and added to vocabulary.
- BERT has 30000 subwords, so about that many merges would be needed.
- For k merges: most frequent adjacent tokens are merged and added to vocabulary.
- Unigram Language Modeling tokenization
- Train a unigram language model for a corpus.
There are two parts:
- A token learner: takes a corpus and induces a vocabulary of tokens
- A token segmenter: takes a text and vocabulary of tokens to segment and classify