Learning Notes
Home
Home
  • Guide
    • Shortkeys
    • Commands
  • Cpp
    • Basic
    • Scope, Duration, and Linkage
    • Pointers(1)
    • Pointers(2)
  • Rust Learning
    • 1.Number Types
    • 2.Ownership
    • 3.Struct
    • 4.Enum
    • 5.Package-management
    • 6.Vector
    • 7.Error handle
    • 8.Trait 泛型
    • 9.Life time
    • 10.Test
    • 11.Minigrep
    • 12.Closure 闭包
    • 13.Smart pointer 智能指针
    • 14.Fearless concurrency 无畏并发
  • Quantum Computing
    • Basic computing and quantum computing Theory
    • Teleportation
    • QKD and QFT
    • QFE and Shor's algorithm
    • QAOA
    • Quantum algorithms
    • Week 11
  • Statical Machine Learning
    • Bayesian Linear Regression
    • Linear regression and logistic regresion
    • Bais
    • SVM, Kernel Methods
    • Precepction and Artificial Neural Network
    • Cross validation, experts learning, MAB
    • Bayesian inference
    • GMM-EM
    • final
  • Software Project Management
    • Lecture
    • Revision
  • AI-planning
    • Classical planning
    • MDP and reinforcement learning
    • Game theory and final recap

Bayesian Inference

Could reason over all parameters that are consistent with the data?

  • Weight with a better fit to the training data should be more probable than others
  • Make predictins with all these weights, scaled by their probability This is the idea underlying Bayesian inference

Frequentist VS Bayesian "divide"

  • Frequenitist: learnin using point estimates, regularisation, p-values
    • backed by sophistivated theory on simplifying assumptions
    • mostly simpler altorithms, characterises much practical machine learning research
  • Bayesian: maintain uncertainty, marginalise(sum) out unknown during inference
    • some theory supported (not as complete as frequenitist)
    • often more complex algorithms, but not always
    • often (not always) more compitationally expensive

Bayesian Regression

Apply Bayesian inference to linear regression, using Normal prior over w Bayes Rule:

p(w∣X,y)=p(y∣X,w)p(w)p(y∣X)max⁡wp(w∣X,y)=max⁡wp(y∣X,w)p(w) p(\mathbf{w}|\mathbf{X}, \mathbf{y}) = \frac{p(\mathbf{y}|\mathbf{X}, \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y}|\mathbf{X})} \\[6pt] \max_{\mathbf{w}} p(\mathbf{w}|\mathbf{X}, \mathbf{y}) = \max_{\mathbf{w}} p(\mathbf{y}|\mathbf{X}, \mathbf{w}) p(\mathbf{w}) p(w∣X,y)=p(y∣X)p(y∣X,w)p(w)​wmax​p(w∣X,y)=wmax​p(y∣X,w)p(w)

p(y∣X,w)p(\mathbf{y}|\mathbf{X}, \mathbf{w})p(y∣X,w) Is the likelihoond function, p(w)p(\mathbf{w})p(w) is the prior and p(y∣X)p(\mathbf{y}|\mathbf{X})p(y∣X) is marginal mikelihood

Consider the full posterior

p(w∣X,y,σ2)=p(y∣X,w,σ2) p(w)p(y∣X,σ2) p(\mathbf{w}|\mathbf{X}, \mathbf{y}, \sigma^2) = \frac{p(\mathbf{y}|\mathbf{X}, \mathbf{w}, \sigma^2) \, p(\mathbf{w})}{p(\mathbf{y}|\mathbf{X}, \sigma^2)} p(w∣X,y,σ2)=p(y∣X,σ2)p(y∣X,w,σ2)p(w)​

A marginal likelihood is a likelihood function that has been integrated over the parameter space
Due to the integration over the parameter space, the marginal likelihood does not directly depend upon the parameters. Marginal likelihood defination

p(y∣X,σ2)=∫p(y∣X,w,σ2) p(w) dw p(\mathbf{y}|\mathbf{X}, \sigma^2) = \int p(\mathbf{y}|\mathbf{X}, \mathbf{w}, \sigma^2) \, p(\mathbf{w}) \, d\mathbf{w} p(y∣X,σ2)=∫p(y∣X,w,σ2)p(w)dw

Conjugate prior: when product of likelihood x prior results in the same distribution as the prior

Evidence can be computed easily using the normalising constant of the Normal distribution

p(w∣X,y,σ2)∝Normal(w∣0,γ2ID) Normal(y∣Xw,σ2IN)∝Normal(w∣wN,VN) p(\mathbf{w}|\mathbf{X}, \mathbf{y}, \sigma^2) \propto \text{Normal}(\mathbf{w}|\mathbf{0}, \gamma^2 \mathbf{I}_D) \, \text{Normal}(\mathbf{y}|\mathbf{Xw}, \sigma^2 \mathbf{I}_N) \\ \propto \text{Normal}(\mathbf{w}|\mathbf{w}_N, V_N) \\[6pt] p(w∣X,y,σ2)∝Normal(w∣0,γ2ID​)Normal(y∣Xw,σ2IN​)∝Normal(w∣wN​,VN​)

Where:

wN=1σ2VNX′yVN=σ2(X′X+σ2γ2ID)−1 \mathbf{w}_N = \frac{1}{\sigma^2} \mathbf{V}_N \mathbf{X}^\prime \mathbf{y} \\ \mathbf{V}_N = \sigma^2 \left( \mathbf{X}^\prime \mathbf{X} + \frac{\sigma^2}{\gamma^2} \mathbf{I}_D \right)^{-1} wN​=σ21​VN​X′yVN​=σ2(X′X+γ2σ2​ID​)−1

wN\mathbf{w}_NwN​ : Mean Vector VN\mathbf{V}_NVN​ : Covariance Matrix

Last Updated:
Contributors: pingzhihe
Prev
Cross validation, experts learning, MAB
Next
GMM-EM