Skip to main content

Pretrained Models and Large Language Models

Pretrained Word Vectors

Pretrained word vectors are useful
For NLP models, the first layer can be initialised with pretrained word embeddings instead of initialised with random weights.
Easy performance gain across a range of NLP tasks

Pretrained Models

Even with pretrained word embeddings, most of the network weights are still randomlt initialised
Can we go one step further? Why not use Whole pertrained network for downstream tasks.

GPT-1

A transformer-based language model
Pretraining objective: next word prediction

Training/Model Details

Subword tokenisation: byte-pair encoding
12 layers of transformeer (110M parameters)
Pretraining corpus: BookCorpus
- 7k unique books of varying genres; long strtches of contiguous text

Adapt To Downstream Tasks

Given pretrained GPT, how do we use it for downstream tasks?
For classification tasks， add an additional classification layer after processing the whole input.
- Classification layer is randomly initialised
Continue training the whole network using the downstream task dataset
- AKA 'fine-tuning'

Fine-tuning: Sentiment Classification

Initialise a classification layer ( $W_y$ ), and use last word's output ( $h_n$ )
$P(y|x) = \text{softmax}(h_y W_y)$
Train Instance: <s> I love pizza! <e> = positive
Loss = log(0.4)

Fine-tuning: Question Answering

GPT Fine Tuning

BERT

BERT: Bidirectional Encoder Representation from Transformers

Uses bidirectional self-attention instead
That is, each target word attends to both left and right context words to build its contextual representation

GPT VS BERT

Objective: Masked Word Prediction

Training/Model Details

Subword tokenisaton: WordPiece
BERT-base: 12 layers of transformers (110M)
BERT-large: 24 layers of transformers (340M)
BERT is pretrained in Wikipedia + BookCorpus
Training taks multiple GPUs over several days

Adapt To Downstream tasks

Given pretrained BERTm how do we use it for downstream tasks?
Same as GPT add a classification layer
- But in this case, we use a special first token ([CLS])

Encoder

BERT's bidirectional transformer is also called a transformer encoder
Each output word is conditioned on bidirectional context (bidirectional attention)
Better contextual representation
Unable to generate text post-training
- Doesn't work for generation tasks, e.g., machine translation

Decoder

GPT's unidirectional tgransformer is also called a transformer decoder
Each output is conditioned on only left context words (unidirectionl attention)
Works on both classification and generation tasks
But performance on classification is not as good as encoder models

Encoder-Decoder

How can we marry the benefits of encoder (strong classification performance but can't do generation) and decoder (poor classification performance but can do generation)?
Combine them!
Use an encoder to process an input
Use an decoder to generate the output

T5

Text-to-text Transfer Transformer

Pretraining objective: span corruption

Other Alternative Objectives

Prefix language model
- cows love to $\rightarrow$ eat grass
Masked language model
- cows <M> <M> eat grass $\rightarrow$ cows love to eat grass
Shuffling:
- grass cows to love eat $\rightarrow$ cows love to eat grass
Drop tokens:
- Cows love grass $\rightarrow$ to eat

Pretraining Corpus

Filtered C4: filtered web Text
- Lots of noise : non-language, non-English and offensive content
- Heuristic applied to filter them
Quality matters (not just quanitity)

Fine-tuning

Cast all tasks as a text-to-text problem, and fine-tune with next word prediction objective

How good is T5?

Comes with a few sizes: small (60M), base(220M), large(770M), 3B and 11B.

Large Language Model

A Large Language Model (LLM) is an autogressive neural network based on the Transformer architecture, pretained on massive text corpora

Pretrained Word Vectors
Pretrained Models
GPT-1
BERT
T5