Pretrained Models and Large Language Models
Pretrained Word Vectors
- Pretrained word vectors are useful
- For NLP models, the first layer can be initialised with pretrained word embeddings instead of initialised with random weights.
- Easy performance gain across a range of NLP tasks
Pretrained Models
- Even with pretrained word embeddings, most of the network weights are still randomlt initialised
- Can we go one step further? Why not use Whole pertrained network for downstream tasks.
GPT-1
- A transformer-based language model
- Pretraining objective: next word prediction
Training/Model Details
- Subword tokenisation: byte-pair encoding
- 12 layers of transformeer (110M parameters)
- Pretraining corpus: BookCorpus
- 7k unique books of varying genres; long strtches of contiguous text
Adapt To Downstream Tasks
- Given pretrained GPT, how do we use it for downstream tasks?
- For classification tasks, add an additional classification layer after processing the whole input.
- Classification layer is randomly initialised
- Continue training the whole network using the downstream task dataset
- AKA 'fine-tuning'
Fine-tuning: Sentiment Classification
- Initialise a classification layer (), and use last word's output ()
- Train Instance:
<s>
I love pizza!<e>
= positive - Loss = log(0.4)
Fine-tuning: Question Answering

BERT
BERT: Bidirectional Encoder Representation from Transformers
- Uses bidirectional self-attention instead
- That is, each target word attends to both left and right context words to build its contextual representation

Objective: Masked Word Prediction
Training/Model Details
- Subword tokenisaton: WordPiece
- BERT-base: 12 layers of transformers (110M)
- BERT-large: 24 layers of transformers (340M)
- BERT is pretrained in Wikipedia + BookCorpus
- Training taks multiple GPUs over several days
Adapt To Downstream tasks
- Given pretrained BERTm how do we use it for downstream tasks?
- Same as GPT add a classification layer
- But in this case, we use a special first token ([CLS])
Encoder
- BERT's bidirectional transformer is also called a transformer encoder
- Each output word is conditioned on bidirectional context (bidirectional attention)
- Better contextual representation
- Unable to generate text post-training
- Doesn't work for generation tasks, e.g., machine translation
Decoder
- GPT's unidirectional tgransformer is also called a transformer decoder
- Each output is conditioned on only left context words (unidirectionl attention)
- Works on both classification and generation tasks
- But performance on classification is not as good as encoder models
Encoder-Decoder
- How can we marry the benefits of encoder (strong classification performance but can't do generation) and decoder (poor classification performance but can do generation)?
- Combine them!
- Use an encoder to process an input
- Use an decoder to generate the output
T5
Text-to-text Transfer Transformer
- Pretraining objective: span corruption
Other Alternative Objectives
-
Prefix language model
- cows love to eat grass
-
Masked language model
- cows
<M> <M>
eat grass cows love to eat grass
- cows
-
Shuffling:
- grass cows to love eat cows love to eat grass
-
Drop tokens:
- Cows love grass to eat
Pretraining Corpus
-
Filtered C4: filtered web Text
- Lots of noise : non-language, non-English and offensive content
- Heuristic applied to filter them
-
Quality matters (not just quanitity)
Fine-tuning
- Cast all tasks as a text-to-text problem, and fine-tune with next word prediction objective
How good is T5?
- Comes with a few sizes: small (60M), base(220M), large(770M), 3B and 11B.
Large Language Model
A Large Language Model (LLM) is an autogressive neural network based on the Transformer architecture, pretained on massive text corpora