Skip to main content

N-gram Language Models

1. Overview

Language models are used to determine the quality of text and measure how fluent a sentence is. They are essential in applications like speech recognition (e.g., distinguishing "recognise speech" from "wreck a nice beach"), query completion, and text generation (e.g., ChatGPT).

2. Joint Probabilities

The goal is to compute the probability of a sequence of words, P(w1,w2,,wm)P(w_1,w_2,\dots,w_m). Using the chain rule, this joint probability is converted into conditional probabilities:

P(w1,w2,,wm)=P(w1)P(w2w1)P(w3w1,w2)P(wmw1,w2,,wm1)P(w_1,w_2,\dots,w_m) = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2)\dots P(w_m|w_1,w_2,\dots,w_{m-1})

3. The markov assumption

To make the computation feasible, the Markov assumption approximates the probability of a word based on the previous n1n -1 words:

P(wiw1,,wi1P(wiwin+1wi1))P(w_i | w_1,\dots,w_{i-1} \approx P(w_i | w_{i-n+1} \dots w_{i-1}))

When n=1n = 1 , a unigram model P(w1,w2,wm)=i=1mP(wi)P(w_1, w_2, \dots w_m) = \prod_{i=1}^m P(w_i) The dog barks

When n=2n = 2 , a bigram model P(w1,w2,wm)=i=1mP(wiwi1)P(w_1, w_2, \dots w_m) = \prod_{i = 1}^{m} P(w_i| w_{i-1}) The dog barks

When n=3n = 3 , a trigram model P(w1,w2,wm)=i=1mP(wiwi2wi1)P(w_1, w_2, \dots w_m) = \prod_{i = 1}^{m} P(w_i| w_{i-2} w_{i-1}) The dog barks

4. Maximum Likelihood Estimation, MLE

Probabilities are estimated using counts from a corpus:

  • Unigram: P(wi)=C(wi)MP(w_i) = \frac{C{(w_i)}}{M} (where MM is the total number of words)
  • Bigram: P(wiwi1)=C(wi1,wi)C(wi)P(w_i| w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_i)}
  • N-gram : P(wiwin+1,,wi1)=C(win+1,,wi1)C(win+1,,wi)P(w_i| w_{i−n+1}​,\dots ,w_{i−1}​) = \frac{C(w_{i−n+1}​,\dots ,w_{i−1})}{C(w_{i−n+1}​,\dots ,w_{i})}

5. Book-ending Sequences

Special tags <s> and </s> denote the start and end of a sentence, respectively,aiding probability calculations (e.g., P(yes<s>,<s>)P(yes∣<s>,<s>) ).

6. Smoothing

Smoothing assigns probability to unseen n-grams to avoid zero probabilities. Common techniques include:

  • Laplacian (Add-one) Smoothing:
  • Add-k Smoothing
  • Absolute Discounting
  • Kneser-Ney Smoothing

6.1 Laplacian

(Add-one) Smoothing: Adds 1 to all counts, e.g.,

Padd1(wiwi1)=C(wi1,wi)+1C(wi1)+V(Where |V| is vocabulary size)P_{add1} (w_i| w_{i-1}) = \frac{C(w_{i-1},w_i)+1} {C(w_{i-1}) + \vert V \vert} \text{(Where |V| is vocabulary size)}

6.2 Add k Smoothing:

Adds a fraction kk,(kk < 1) e.g.,

Paddk(wiwi1)=C(wi1,wi)+kC(wi1+kV)P_{\text{addk}} (w_i| w_{i-1}) = \frac{C(w_{i-1},w_i) + k}{C(w_{i-1} + k \vert V \vert)}

6.3 Absolute Discounting

Substracts a fixed amount from observed counts and redistributes to unseen n-grams.

6.4 Kneser-Ney Smoothing

Using continuation probabilities based on gthe versatility of lower-order n-grams.

Text classification

1.Fundamentals of Classification

Classification involves predicting a class label cc from a fixed set of classes C={c1,c2,,ck}C = \{c_1, c_2, \ldots, c_k\} for a given input document dd, which is often represented as a vector of features. It differs from regression (which predicts continuous outputs) and ranking (which predicts ordinal outputs).

2. Text Classification Tasks

Common text calssification tasks include:

  • Topic Classification : Categorizing text by topic (e.g.,"acquisitions", "earnings").

  • Sentiment Analysis: Dtermine sentiment ploarity (e.g., positive, negative, neutral).

  • Native-Lenaguage Identification :Identifying the author's native language.

  • Natural-Language Inference: Determining relationships between sentences (e.g., entailment, contradiction, neutral).

  • Automatic Fact-Checking and Paraphrase Detecting : Other specialized tasks Note: input may vary in length, from long documents to short sentences or tweets.

3. Build a Text Classifier

steps to build a text Classifier

  1. Identify a task of interest
  2. Collect an appropriate corpus
  3. Carry out annotation (label the data)
  4. Select features (e.g., n-grams, word overlap)
  5. Choose a machine learning algorithm
  6. Train the model the tune hyper-parameter using held-out development data.
  7. Repeat ealier steps as needed to improve performance
  8. Train the final model
  9. Evaluate the model on held-out test data

4. (Algorithms for Classification)

4.1 Navie Bayes

Uses Baye's law to find the class with the highest posterior probability.

P(cnf1,fm)P(cn)i=1mP(ficn)P(c_n| f_1, \dots f_m) \propto P(c_n) \prod_{i=1}^{m} P(f_i|c_n)
  • Assumes features are independent.
  • Pros: Fast to train and classify, robust (low variance), good for low-datas situations
  • Cons: Independence assumption rarely holds, lower accuracy in many cases, requires smoothing for unseen features.

4.2 Logistic regression

A linear model using softmax to produce probabilities

P(cnf1,,fm)=1Zexp(i=0mwifi)P(c_n | f_1, \dots , f_m) = \frac{1}{Z} \text{exp} (\sum_{i = 0}^{m} w_i f_i)
  • Pros: Handles correlated features well, better performance than Naive Bayes in many cases.

  • Cons: Slow to train, requires frature scaling, needs regualrzation to prevent overfitting, data hungry.

4.3 Support Vector Machines, SVM

Finds the hyperplane that separates classes with the maximum margin. Pros: Fast, accurate, works well with large feature sets, suppors non-linearity via kernel trick.

Cons: Multiclass classification is complex

4.4 KNN

4.5 Decision Tree

4.6 Random Forest

4.7 Nural Networks

Layers of interconnected nodes (input, hidden, output) with activation functions.

Pros: Extremely powerful, minimal feature engineering, dominant in NLP.

Cons: Complex, slow to train, many hyper-parameters, prone to overfitting.

5. Hyper-parameter Tuning

  • Use a development set or kk-fold cross-vaildation to tune hyper-parameters (e.g., tree depth, regualarization strength).

  • Regualrization prevents overfitting by penalizing model complexity.

6 A final word

Many algorithms are available, but annotation quality, dataset size, and feature selection often matter more than the choice of algorithm for good performance.