Learning Notes
Home
Home
  • Guide
    • Shortkeys
    • Commands
  • Cpp
    • Basic
    • Scope, Duration, and Linkage
    • Pointers(1)
    • Pointers(2)
  • Rust Learning
    • 1.Number Types
    • 2.Ownership
    • 3.Struct
    • 4.Enum
    • 5.Package-management
    • 6.Vector
    • 7.Error handle
    • 8.Trait 泛型
    • 9.Life time
    • 10.Test
    • 11.Minigrep
    • 12.Closure 闭包
    • 13.Smart pointer 智能指针
    • 14.Fearless concurrency 无畏并发
  • Quantum Computing
    • Basic computing and quantum computing Theory
    • Teleportation
    • QKD and QFT
    • QFE and Shor's algorithm
    • QAOA
    • Quantum algorithms
    • Week 11
  • Statical Machine Learning
    • Bayesian Linear Regression
    • Linear regression and logistic regresion
    • Bais
    • SVM, Kernel Methods
    • Precepction and Artificial Neural Network
    • Cross validation, experts learning, MAB
    • Bayesian inference
    • GMM-EM
    • final
  • Software Project Management
    • Lecture
    • Revision
  • AI-planning
    • Classical planning
    • MDP and reinforcement learning
    • Game theory and final recap

From lec2 to lec5

Linear Regression

Model: H=a+bTH = a + bTH=a+bT

  • Criterion: minimise the sum of squared errors

Expressed as matrix form:

y=Xβ+ϵ y = X\beta + \epsilon y=Xβ+ϵ

where ϵ∼N(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2)ϵ∼N(0,σ2)
thus y∼N(x′w,ϵ)y \sim \mathcal{N}(x'w , \epsilon)y∼N(x′w,ϵ)

Maximum-likelihood estimation: Choose parameter values that maximise the probability of observed data

p(y1,...,yn∣x1,...,xn)=∏i=1np(yi∣xi) p(y_1, ..., y_n|x_1, ..., x_n) = \prod_{i=1}^n p(y_i|x_i) p(y1​,...,yn​∣x1​,...,xn​)=i=1∏n​p(yi​∣xi​)

Maximising log-likelihood as a function of 𝒘 is equivalent to minimising the sum of squared errors

L=y′y−2y′Xw+wX′Xw L = y'y - 2y'Xw + wX'Xw L=y′y−2y′Xw+wX′Xw

Solving for 𝒘 yields:

w^=(XTX)−1XTy \hat{w} = (X^T X)^{-1} X^T y w^=(XTX)−1XTy

𝑿′𝑿'X′ denotes transpose
X′XX'XX′X should be an invertible matrix, that is: a full rank matrix, column vectorsw linear independence

Basis expansion for linear regression: Turn linear regression to polynomial regression for better fit the data

Logistic regression

Logistic regression assumes a Bernoulli distribution with parameter
and is a model for solving binary classification problems

P(Y=1∣x)=11+exp⁡(−x′w) P(Y=1|\mathbf{x}) = \frac{1}{1 + \exp(-\mathbf{x}'w)} P(Y=1∣x)=1+exp(−x′w)1​

MLE for Logistic regression cannot get an analytical solution.

We introduce: Approximate iterative solution

  • Stochastic Gradient Descent (SGD)
  • Newton-Raphson Method

Stochastic Gradient Descent (SGD)

Newton-Raphson Method

alt text

Regularization

Normal equations solution of linear regression:

w^=(XTX)−1XTy \hat{w} = (X^T X)^{-1} X^T y w^=(XTX)−1XTy

With irrelevant/multicolinear features, matrix 𝐗!𝐗 has no inverseRegularisation: introduce an additional condition into the system

∥y−Xw∥22+λ∥w∥22 for λ>0 \|y - Xw\|_2^2 + \lambda\|w\|_2^2 \text{ for } \lambda > 0 ∥y−Xw∥22​+λ∥w∥22​ for λ>0

Adds 𝜆 to eigenvalues of 𝐗′𝐗𝐗'𝐗X′X: makes invertible : ridge regression

alt text

Lasso (L1 regularisation) encourages solutions to sit on the axes
Some of the weights are set to zero →\rightarrow→ Solution is sparse

Last Updated:
Contributors: pingzhihe
Prev
Bayesian Linear Regression
Next
Bais