Skip to main content

2. Bias-variance trade-off

Bias-variance trade-off

  • simple model \rightarrow low variance, high bias
  • complex model \rightarrow high variance, low bias
alt textalt text
high variancehigh bias

Testing and trainning error alt text

Empirical Risk and True Risk

The true risk is the expected value of the loss 𝑙

R[f]=E[l(Y,f(X))]=l(Y,f(X))P(X,Y)dXdYR[f] = \mathbb{E}[l(Y, f(X))] = \int l(Y, f(X)) P(X, Y) dX dY
  • l(Y,f(X))l(Y, f(X)): Loss function
  • P(X,y)P(X,y) joint distribution of input X and target Y

The empirical risk is an estimate (an average of a finite number of samples 𝑛) of the expected value

  • sample size \rightarrow \infty, the true risk is the empirical risk

Structural Risk Minimisation

R^D[f^D]+logF+log(1/δ)2n\hat{R}_D[\hat{f}_D] + \sqrt{\frac{\log |\mathcal{F}| + \log(1/\delta)}{2n}}

VC dimension

F shatters {x1,x2}\mathcal{F} \ \textbf{shatters} \ \{ x_1, x_2 \} : Possible Labelings

  1. (0, 0)
  2. (0, 1)
  3. (1, 0)
  4. (1, 1)

If 1 of the 4 labels missing, F does not shatter {x1,x2}\mathcal{F} \ \textbf{does not shatter} \ \{ x_1, x_2 \}

The number of maximun shattering sets = VC(F)VC(\mathcal{F})

if F\mathcal{F} shatters {x1,x2,x3}\{x_1, x_2, x_3\} and unique rows: 23=82^3 = 8, VC(F)=3VC(\mathcal{F}) = 3

Sauer-Shelah Lemma The Upper bound of growth function SF(n)S_\mathcal{F}(n) given finite VC(F)VC(\mathcal{F}):

SF(n)i=0VC(F)(ni)S_\mathcal{F}(n) \leq \sum_{i=0}^{VC(\mathcal{F})} \binom{n}{i}

Since i=0k(ni)(n+1)k\sum_{i=0}^{k} \binom{n}{i} \leq (n+1)^{k}, the above implies:

logSF(n)VC(F)log(n+1)\log S_\mathcal{F}(n) \leq VC(\mathcal{F}) \log(n+1)

Structural Risk Minimisation(VC)(VC Generalisation Theorem)

R^D[f^D]+8VC(F)log(2n+1)+log(4/δ)n\hat{R}_D[\hat{f}_D] + \sqrt{8\frac{VC(\mathcal{F})\log(2n + 1) + \log(4/\delta)}{n}}