N-gram Language Models
1. Overview
Language models are used to determine the quality of text and measure how fluent a sentence is. They are essential in applications like speech recognition (e.g., distinguishing "recognise speech" from "wreck a nice beach"), query completion, and text generation (e.g., ChatGPT).
2. Joint Probabilities
The goal is to compute the probability of a sequence of words, . Using the chain rule, this joint probability is converted into conditional probabilities:
3. The markov assumption
To make the computation feasible, the Markov assumption approximates the probability of a word based on the previous words:
When , a unigram model The dog barks
When , a bigram model The dog barks
When , a trigram model The dog barks
4. Maximum Likelihood Estimation, MLE
Probabilities are estimated using counts from a corpus:
- Unigram: (where is the total number of words)
- Bigram:
- N-gram :
5. Book-ending Sequences
Special tags <s>
and </s>
denote the start and end of a sentence, respectively,aiding probability calculations (e.g., ).
6. Smoothing
Smoothing assigns probability to unseen n-grams to avoid zero probabilities. Common techniques include:
- Laplacian (Add-one) Smoothing:
- Add-k Smoothing
- Absolute Discounting
- Kneser-Ney Smoothing
6.1 Laplacian
(Add-one) Smoothing: Adds 1 to all counts, e.g.,
6.2 Add k Smoothing:
Adds a fraction ,( < 1) e.g.,
6.3 Absolute Discounting
Substracts a fixed amount from observed counts and redistributes to unseen n-grams.
6.4 Kneser-Ney Smoothing
Using continuation probabilities based on gthe versatility of lower-order n-grams.
Text classification
1.Fundamentals of Classification
Classification involves predicting a class label from a fixed set of classes for a given input document , which is often represented as a vector of features. It differs from regression (which predicts continuous outputs) and ranking (which predicts ordinal outputs).
2. Text Classification Tasks
Common text calssification tasks include:
-
Topic Classification : Categorizing text by topic (e.g.,"acquisitions", "earnings").
-
Sentiment Analysis: Dtermine sentiment ploarity (e.g., positive, negative, neutral).
-
Native-Lenaguage Identification :Identifying the author's native language.
-
Natural-Language Inference: Determining relationships between sentences (e.g., entailment, contradiction, neutral).
-
Automatic Fact-Checking and Paraphrase Detecting : Other specialized tasks Note: input may vary in length, from long documents to short sentences or tweets.
3. Build a Text Classifier
steps to build a text Classifier
- Identify a task of interest
- Collect an appropriate corpus
- Carry out annotation (label the data)
- Select features (e.g., n-grams, word overlap)
- Choose a machine learning algorithm
- Train the model the tune hyper-parameter using held-out development data.
- Repeat ealier steps as needed to improve performance
- Train the final model
- Evaluate the model on held-out test data
4. (Algorithms for Classification)
4.1 Navie Bayes
Uses Baye's law to find the class with the highest posterior probability.
- Assumes features are independent.
- Pros: Fast to train and classify, robust (low variance), good for low-datas situations
- Cons: Independence assumption rarely holds, lower accuracy in many cases, requires smoothing for unseen features.
4.2 Logistic regression
A linear model using softmax to produce probabilities
-
Pros: Handles correlated features well, better performance than Naive Bayes in many cases.
-
Cons: Slow to train, requires frature scaling, needs regualrzation to prevent overfitting, data hungry.
4.3 Support Vector Machines, SVM
Finds the hyperplane that separates classes with the maximum margin. Pros: Fast, accurate, works well with large feature sets, suppors non-linearity via kernel trick.
Cons: Multiclass classification is complex
4.4 KNN
4.5 Decision Tree
4.6 Random Forest
4.7 Nural Networks
Layers of interconnected nodes (input, hidden, output) with activation functions.
Pros: Extremely powerful, minimal feature engineering, dominant in NLP.
Cons: Complex, slow to train, many hyper-parameters, prone to overfitting.
5. Hyper-parameter Tuning
-
Use a development set or -fold cross-vaildation to tune hyper-parameters (e.g., tree depth, regualarization strength).
-
Regualrization prevents overfitting by penalizing model complexity.
6 A final word
Many algorithms are available, but annotation quality, dataset size, and feature selection often matter more than the choice of algorithm for good performance.