Skip to main content

5. Cross validation

Cross validation

Compare different algorithms in terms of performance from limited available data and help choose hyperparameters for an algorithm

Performance matrices

  • Given a dataset D={x1,y1,,xn,yn}D = \{x_1, y_1, \ldots, x_n, y_n \} of n samples.
  • For a data point xix_i we predict g(xi)g(x_i)

Matrices in regression

  • Mean squared error: MSE(g)=1ni=1n(g(xi)yi)2MSE(g) = \frac{1}{n} \sum_{i = 1}^{n}(g(x_i)-y_i)^2
  • Root mean squared error: RMSE(g)=MSE(g)RMSE(g) = \sqrt{MSE(g)}
  • Mean absolute error: 1ni=1ng(xi)yi\frac{1}{n}\sum_{i=1}^{n} |g(x_i)-y_i|

Performance metrics in classification

True Label +1True Label -1
Predicted +1TPFP
Predicted -1FNTN
  • Accurcy: (𝑇𝑃 + 𝑇𝑁) / (𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)
  • Error: (𝐹𝑃 + 𝐹𝑁)/(𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)
  • Recall/Sensitivity: 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
  • Precision 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)
  • Specificity 𝑇𝑁/(𝑇𝑁 + 𝐹𝑃)
  • F1-score: 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙/(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)

Receiver Operating Characteristic (ROC)

Summarized with an Area Under the Curve (AUC):

  • Random: 0.5
  • Perfect Classifier: 1

Modified loss function

Different misclassification have difference influence

Predicting a diseased patient as healthy has much worse serious consequences than predicting a diseased patient as healthy!!!

alt text

Cross-validation: using “unseen” data

Split datasets in three parts

Training SetValidation SetTest Set

K-Fold Cross Validation

  • Split training data DD into kk disjoint sets S1,,SkS_1, \ldots, S_k
    • Either randomly, or in a fixed fashion
    • If DD has nn samples, then each fold has approximately nk\frac{n}{k} samples
    • Popular choices: k=5,k=10,k=nk = 5, k = 10, k = n (leave one out)
  • For i=1ki = 1 \ldots k
    • train with sets S1,,Si1,Si+1,,SkS_1, \ldots, S_{i-1}, S_{i+1},\ldots, S_k
    • test on set SiS_i
    • let MiM_i be the test metric (e.g., accuracy, MSE)

Mean u^=i=1k\hat{u} = \sum_{i = 1}^{k}\quad variance σ^2=i=1k(Miu^)2\hat{\sigma}^2 = \sum_{i = 1}^{k} (M_i - \hat{u})^2

0.632 Bootstrapping

Let B>0B > 0, and nn be the number of training samples in DD

  • For i=1Bi = 1 \ldots B :
    • Pick n samples from DD with replacment, call it SiS_i (SiS_i might contain the same sample more than once)
    • train with set SiS_i
    • test on the remaining samples (DSi)(D - S_i)
    • let MiM_i be the test metric (e.g., accuracy, MSE)

Mean u^=i=1BMi\hat{u} = \sum_{i = 1}^{B} M_i \quad variance σ^2=i=1k(Miu^)2\hat{\sigma}^2 = \sum_{i = 1}^{k} (M_i - \hat{u})^2

  • not picking one particular item in 1 draw is 1 − 1/𝑛
  • not picking one particular item in 𝑛 draws is (11/𝑛)2(1 − 1/𝑛)^2
  • picking one particular item in 𝑛 draws is 1(11/𝑛)21- (1 − 1/𝑛)^2

Finally: limn1(11/n)n=11/e0.632\lim_{n\to\infty} 1-(1-1/n)^n = 1-1/e \approx 0.632

Nested Cross-valdation

  • Useful for hyperparameter tuning
  • Inner cross-vaildation (e.g., 0.632 bootstapping) to try different hypermarameters
  • Outer cross-validation (e.g., k-folds) to report metric using the best hyptermenter

Hypothesis testing

  • Let u1^,u2^,σ1^2,σ2^2\hat{u_1},\hat{u_2},\hat{\sigma_1}^2,\hat{\sigma_2}^2 be mean and variance of algorithms 1 and 2
Letx=(μ^1μ^2)nσ^12+σ^22ν=(σ^12+σ^22)2(n1)σ^14+σ^24\text{Let} \, x = \frac{(\hat{\mu}_1-\hat{\mu}_2)\sqrt{n}}{\sqrt{\hat{\sigma}_1^2+\hat{\sigma}_2^2}} \qquad \nu = \left\lfloor\frac{(\hat{\sigma}_1^2+\hat{\sigma}_2^2)^2(n-1)}{\hat{\sigma}_1^4+\hat{\sigma}_2^4}\right\rfloor

v using upper bound, degrees of freedom of Student's t-distribution

Examples

alt text

Learning with expert advice

In machine learning, the term "expert" typically refers to a model or algorithm with specialized knowledge or predictive ability for a specific task or domain.

Learning from expert advice

  • Learner listens to some / all experts making decisions
  • True outcomes are ADVERSARIAL
  • Learner updates weights over experts based on their loses
  • Algorithm all forms of "multiplicative weights"
  • Nice clean bounds on total mistakes/losses: by "potential function" technique

Adversarial

  • The real outcomes do not come from a fixed probability distribution, but rather may be chosen by an adversary attempting to make the learner fail. This is a conservative assumption that ensures the algorithm performs well even in worst-case scenarios. Under this setting, algorithms need to be adversarial, capable of handling the most unfavorable situations. potential function and clean upper bound Using potential functions, we can derive explicit upper bounds on the algorithm's cumulative number of mistakes or total loss.

Infallible expert(one always perfect)

  • Majority Vote Algorithm

Imperfect experts(none guaranteed perfect)-incresingly better

  • Weighted Majority Vote Algorithm by Halving
  • Weighted Majority Voting by General Multiplicative Weights
  • Probabilistic Experts Algorithm

Infallible Expert Algorithm: Majority Vote

  1. Initialzie set of experts who haven't made any mistakes
  2. Repeat per round
    • Observe perdictions E_i for all i{1,,n}i \in \{1, \ldots, n\}
    • Make majority prediction arg maxy{1,1}iE1[Ei=y]\argmax_{y \in \{-1,1\}} \sum_{i \in E} 1[E_i = y]
    • Overve correct outcome
    • Remove mistaken experts from E

Mistake bound for Majority Vote Under infallible expert assumption, majiroty vote makes total misktakes Mlog2nM \leq \log_2 n

Imperfect experts and the Halving Algorithm

No more guarantee of an infallible expert

“Zero tolerance” dropping experts on a mistake is No sense

Imperfect experts: Halving Algorithm

  1. Initialize wi=1w_i = 1 weight of expert EiE_i
  2. Repeat per round
    • Observe predicitions EiE_i for all i{1,,n}i \in \{1, \ldots, n\}
    • Make weighted majority prediction arg maxy{1,1}iEwi1[Ei=y]\argmax_{y \in \{-1,1\}} \sum_{i \in E} w_i 1[E_i = y]
    • Oberve correct outcome
    • Downweigh each mistaken expert Eiwiwi/2E_i \quad w_i \leftarrow w_i / 2

Mistake bound for Halving
If the best expert makes m mistakes, then weighted majority vote makes M2.4(m+log2n)M \leq 2.4(m + \log_2 n) mistakes

From Halving to Multiply weights by 1ϵ1- \epsilon

  1. Initialize wi=1w_i = 1 weight of expert EiE_i
  2. Repeat per round
    • Observe predicitions EiE_i for all i{1,,n}i \in \{1, \ldots, n\}
    • Make weighted majority prediction arg maxy{1,1}iEwi1[Ei=y]\argmax_{y \in \{-1,1\}} \sum_{i \in E} w_i 1[E_i = y]
    • Oberve correct outcome
    • Downweigh each mistaken expert Eiwiwi(1ϵ)E_i \quad w_i \leftarrow w_i (1- \epsilon)

Mistaken bound for 1ϵ1- \epsilon
If the best expert makes mm mistakes, then 1ϵ1-\epsilon weighted majority vote makes M2(1+ϵ)m+(2logen)/ϵM \leq 2(1+ \epsilon)m + (2\log_e n) /\epsilon mistakes

Mini Conclusion

In the case of imperfect experts, the number of mistakes made by the weighted majority algorithm depends on the number of mistakes made by the best expert 𝑚, and the coefficient 2 in this dependency relationship is unavoidable.

The probabilistic experts algorithm

  • Change 1: change from mistakes to loss: Lossi(t)[0,1]\text{Loss} \ell_i^{(t)} \in [0,1] of EiE_i
  • Change 2: Randomised algorithm means, bounding expected losses (It is a risk!)
  1. Initialize wi=1w_i = 1 weight of expert EiE_i
  2. Repeat per round
    • Oberve predicitons EiE_i for all i{1,,n}i \in \{1, \ldots, n\}
    • Predict EiE_i of expert i with probability wiW\frac{w_i}{W} where W=j=1nwjW = \sum_{j=1}^{n} w_j
    • Observe losses
    • Update each weight wi(1ϵ)i(t)wiw_i \leftarrow (1-\epsilon)^{\ell_i^{(t)}} w_i

Expected loss bound
Expected loss of probalistic expert alghorithm is Llognϵ+(1+ϵ)LL \leq \frac{\log n }{\epsilon} + (1 + \epsilon)L^* where LL^* is the minimum loss over experts

MAB

Where we learn to take actions; we receive only indirect supervision in the form of rewards

Multi-Armed Bandit

  • Simplest setting for balancing exploration, exploitation
  • Same familiy of ML tasks as reinforcement learning

Stochastic MAB setting

Possible actions {1,,k}\{ 1,\ldots, k \} called "arms"

  • Arm i has distributon PiPi on bounded rewards with mean μi\mu_i

In round t=1,,Tt = 1, \ldots, T:

  • Play action it{1,,k}i_t \in \{1, \ldots, k\}
  • Receive reward Rit(t)pitR_{i_t}(t) \sim p_{i_t}

Goal: minimize cumlative regret

  • uTt=1TE[Rit(t)]u^* T - \sum_{t=1}^{T} E[R_{i_t}(t)] where u=maxiuiu^* = \max_i u_i

Greedy

At round t,

  • Estimate value of each arm i as average reward observed
Qt1(i)={s=1t1Ri(s)1[is=i]s=1t11[is=i],if s=1t11[Is=i]>0Q0,otherwiseQ_{t-1}(i) = \begin{cases} \frac{\sum_{s=1}^{t-1} R_i(s) \mathbb{1}[i_s = i]}{\sum_{s=1}^{t-1} \mathbb{1}[i_s = i]}, & \text{if } \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] > 0 \\ Q_0, & \text{otherwise} \end{cases}

Some init constant Q0(i)=Q0Q_0(i) = Q_0 used until arm i has been puolled

Eploit! itarg max1ikQt1(i)i_t \in \argmax_{1\leq i \leq k} Q_{t-1}(i)

ϵ\epsilon - Greedy

At round t

  • Estimate value of each arm ii as average reward observed (Q- value function is same as greedy)
Qt1(i)={s=1t1Ri(s)1[is=i]s=1t11[is=i],if s=1t11[Is=i]>0Q0,otherwiseQ_{t-1}(i) = \begin{cases} \frac{\sum_{s=1}^{t-1} R_i(s) \mathbb{1}[i_s = i]}{\sum_{s=1}^{t-1} \mathbb{1}[i_s = i]}, & \text{if } \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] > 0 \\ Q_0, & \text{otherwise} \end{cases}

Exloit:

it{argmax1ikQt1(i),with probability 1ϵUnif({1,,k}),with probability ϵi_t \sim \begin{cases} \arg \max_{1 \leq i \leq k} Q_{t-1}(i), & \text{with probability } 1 - \epsilon \\ \text{Unif}(\{1, \ldots, k\}), & \text{with probability } \epsilon \end{cases}

ϵ\epsilon controls exploration vs. exploitation

  • Larger ϵ\epsilon will increase the change to exploration
  • Smaller ϵ\epsilon will tend to choose exploitation

Upper-Confidence Bound (UCB)

Qt1(i)={μ^t1(i)+2log(t)Nt1(i),if s=1t11[Is=i]>0Q0,otherwiseQ_{t-1}(i) = \begin{cases} \hat{\mu}_{t-1}(i) + \sqrt{\frac{2 \log(t)}{N_{t-1}(i)}}, & \text{if } \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] > 0 \\ Q_0, & \text{otherwise} \end{cases}

Some init constant Q0(i)=Q0Q_0(i) = Q_0 used until arm i has been puolled

Nt1(i)=s=1t11[Is=i]μ^t1(i)=s=1t1Ri(s)1[Is=i]s=1t11[Is=i]N_{t-1}(i) = \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] \quad \hat{\mu}_{t-1}(i) = \frac{\sum_{s=1}^{t-1} R_i(s) \mathbb{1}[I_s = i]}{\sum_{s=1}^{t-1} \mathbb{1}[I_s = i]}