Learning Notes
Home
Home
  • Guide
    • Shortkeys
    • Commands
  • Cpp
    • Basic
    • Scope, Duration, and Linkage
    • Pointers(1)
    • Pointers(2)
  • Rust Learning
    • 1.Number Types
    • 2.Ownership
    • 3.Struct
    • 4.Enum
    • 5.Package-management
    • 6.Vector
    • 7.Error handle
    • 8.Trait 泛型
    • 9.Life time
    • 10.Test
    • 11.Minigrep
    • 12.Closure 闭包
    • 13.Smart pointer 智能指针
    • 14.Fearless concurrency 无畏并发
  • Quantum Computing
    • Basic computing and quantum computing Theory
    • Teleportation
    • QKD and QFT
    • QFE and Shor's algorithm
    • QAOA
    • Quantum algorithms
    • Week 11
  • Statical Machine Learning
    • Bayesian Linear Regression
    • Linear regression and logistic regresion
    • Bais
    • SVM, Kernel Methods
    • Precepction and Artificial Neural Network
    • Cross validation, experts learning, MAB
    • Bayesian inference
    • GMM-EM
    • final
  • Software Project Management
    • Lecture
    • Revision
  • AI-planning
    • Classical planning
    • MDP and reinforcement learning
    • Game theory and final recap

Bias-variance trade-off

  • simple model →\rightarrow→ low variance, high bias
  • complex model →\rightarrow→ high variance, low bias
alt textalt text
high variancehigh bias

Testing and trainning erroralt text

Empirical Risk and True Risk

The true risk is the expected value of the loss 𝑙

R[f]=E[l(Y,f(X))]=∫l(Y,f(X))P(X,Y)dXdY R[f] = \mathbb{E}[l(Y, f(X))] = \int l(Y, f(X)) P(X, Y) dX dY R[f]=E[l(Y,f(X))]=∫l(Y,f(X))P(X,Y)dXdY

  • l(Y,f(X))l(Y, f(X))l(Y,f(X)): Loss function
  • P(X,y)P(X,y)P(X,y) joint distribution of input X and target Y

The empirical risk is an estimate (an average of a finite number of samples 𝑛) of the expected value

  • sample size →∞\rightarrow \infty→∞, the true risk is the empirical risk

Structural Risk Minimisation

R^D[f^D]+log⁡∣F∣+log⁡(1/δ)2n \hat{R}_D[\hat{f}_D] + \sqrt{\frac{\log |\mathcal{F}| + \log(1/\delta)}{2n}} R^D​[f^​D​]+2nlog∣F∣+log(1/δ)​​

VC dimension

F shatters {x1,x2}\mathcal{F} \ \textbf{shatters} \ \{ x_1, x_2 \}F shatters {x1​,x2​} : Possible Labelings

  1. (0, 0)
  2. (0, 1)
  3. (1, 0)
  4. (1, 1)

If 1 of the 4 labels missing, F does not shatter {x1,x2}\mathcal{F} \ \textbf{does not shatter} \ \{ x_1, x_2 \}F does not shatter {x1​,x2​}

The number of maximun shattering sets = VC(F)VC(\mathcal{F})VC(F)

if F\mathcal{F}F shatters {𝑥1, 𝑥2, 𝑥4} and unique rows: 23=82^3 = 823=8, VC(F)VC(\mathcal{F})VC(F) = 3

Sauer-Shelah Lemma The Upper bound of growth function SF(n)S_\mathcal{F}(n)SF​(n) given finite VC(F)VC(\mathcal{F})VC(F):

SF(n)≤∑i=0VC(F)(ni) S_\mathcal{F}(n) \leq \sum_{i=0}^{VC(\mathcal{F})} \binom{n}{i} SF​(n)≤i=0∑VC(F)​(in​)

Since ∑i=0k(ni)≤(n+1)k\sum_{i=0}^{k} \binom{n}{i} \leq (n+1)^{k}∑i=0k​(in​)≤(n+1)k, the above implies:

log⁡SF(n)≤VC(F)log⁡(n+1) \log S_\mathcal{F}(n) \leq VC(\mathcal{F}) \log(n+1) logSF​(n)≤VC(F)log(n+1)

Structural Risk Minimisation(VC)(VC Generalisation Theorem)

R^D[f^D]+8VC(F)log⁡(2n+1)+log⁡(4/δ)n \hat{R}_D[\hat{f}_D] + \sqrt{8\frac{VC(\mathcal{F})\log(2n + 1) + \log(4/\delta)}{n}} R^D​[f^​D​]+8nVC(F)log(2n+1)+log(4/δ)​​

Last Updated:
Contributors: pingzhihe
Prev
Linear regression and logistic regresion
Next
SVM, Kernel Methods