Learning Notes
Home
Home
  • Guide
    • Shortkeys
    • Commands
  • Cpp
    • Basic
    • Scope, Duration, and Linkage
    • Pointers(1)
    • Pointers(2)
  • Rust Learning
    • 1.Number Types
    • 2.Ownership
    • 3.Struct
    • 4.Enum
    • 5.Package-management
    • 6.Vector
    • 7.Error handle
    • 8.Trait 泛型
    • 9.Life time
    • 10.Test
    • 11.Minigrep
    • 12.Closure 闭包
    • 13.Smart pointer 智能指针
    • 14.Fearless concurrency 无畏并发
  • Quantum Computing
    • Basic computing and quantum computing Theory
    • Teleportation
    • QKD and QFT
    • QFE and Shor's algorithm
    • QAOA
    • Quantum algorithms
    • Week 11
  • Statical Machine Learning
    • Bayesian Linear Regression
    • Linear regression and logistic regresion
    • Bais
    • SVM, Kernel Methods
    • Precepction and Artificial Neural Network
    • Cross validation, experts learning, MAB
    • Bayesian inference
    • GMM-EM
    • final
  • Software Project Management
    • Lecture
    • Revision
  • AI-planning
    • Classical planning
    • MDP and reinforcement learning
    • Game theory and final recap

Cross validation

Compare different algorithms in terms of performance from limited available data and help choose hyperparameters for an algorithm

Performance matrices

  • Given a dataset D={x1,y1,…,xn,yn}D = \{x_1, y_1, \ldots, x_n, y_n \}D={x1​,y1​,…,xn​,yn​} of n samples.
  • For a data point xix_ixi​ we predict g(xi)g(x_i)g(xi​)

Matrices in regression

  • Mean squared error: MSE(g)=1n∑i=1n(g(xi)−yi)2MSE(g) = \frac{1}{n} \sum_{i = 1}^{n}(g(x_i)-y_i)^2MSE(g)=n1​∑i=1n​(g(xi​)−yi​)2
  • Root mean squared error: RMSE(g)=MSE(g)RMSE(g) = \sqrt{MSE(g)}RMSE(g)=MSE(g)​
  • Mean absolute error: 1n∑i=1n∣g(xi)−yi∣\frac{1}{n}\sum_{i=1}^{n} |g(x_i)-y_i|n1​∑i=1n​∣g(xi​)−yi​∣

Performance metrics in classification

True Label +1True Label -1
Predicted +1TPFP
Predicted -1FNTN
  • Accurcy: (𝑇𝑃 + 𝑇𝑁) / (𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)
  • Error: (𝐹𝑃 + 𝐹𝑁)/(𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)
  • Recall/Sensitivity: 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)
  • Precision 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)
  • Specificity 𝑇𝑁/(𝑇𝑁 + 𝐹𝑃)
  • F1-score: 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙/(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)

Receiver Operating Characteristic (ROC)

Summarized with an Area Under the Curve (AUC):

  • Random: 0.5
  • Perfect Classifier: 1

Modified loss function

Different misclassification have difference influence

Predicting a diseased patient as healthy has much worse serious consequences than predicting a diseased patient as healthy!!!

alt text

Cross-validation: using “unseen” data

Split datasets in three parts

Training SetValidation SetTest Set

K-Fold Cross Validation

  • Split training data DDD into kkk disjoint sets S1,…,SkS_1, \ldots, S_kS1​,…,Sk​
    • Either randomly, or in a fixed fashion
    • If DDD has nnn samples, then each fold has approximately nk\frac{n}{k}kn​ samples
    • Popular choices: k=5,k=10,k=nk = 5, k = 10, k = nk=5,k=10,k=n (leave one out)
  • For i=1…ki = 1 \ldots ki=1…k
    • train with sets S1,…,Si−1,Si+1,…,SkS_1, \ldots, S_{i-1}, S_{i+1},\ldots, S_kS1​,…,Si−1​,Si+1​,…,Sk​
    • test on set SiS_iSi​
    • let MiM_iMi​ be the test metric (e.g., accuracy, MSE)

Mean u^=∑i=1k\hat{u} = \sum_{i = 1}^{k}\quadu^=∑i=1k​ variance σ^2=∑i=1k(Mi−u^)2\hat{\sigma}^2 = \sum_{i = 1}^{k} (M_i - \hat{u})^2σ^2=∑i=1k​(Mi​−u^)2

0.632 Bootstrapping

Let B>0B > 0B>0, and nnn be the number of training samples in DDD

  • For i=1…Bi = 1 \ldots Bi=1…B :
    • Pick n samples from DDD with replacment, call it SiS_iSi​ (SiS_iSi​ might contain the same sample more than once)
    • train with set SiS_iSi​
    • test on the remaining samples (D−Si)(D - S_i)(D−Si​)
    • let MiM_iMi​ be the test metric (e.g., accuracy, MSE)

Mean u^=∑i=1BMi\hat{u} = \sum_{i = 1}^{B} M_i \quadu^=∑i=1B​Mi​ variance σ^2=∑i=1k(Mi−u^)2\hat{\sigma}^2 = \sum_{i = 1}^{k} (M_i - \hat{u})^2σ^2=∑i=1k​(Mi​−u^)2

  • not picking one particular item in 1 draw is 1 − 1/𝑛
  • not picking one particular item in 𝑛 draws is (1−1/𝑛)2(1 − 1/𝑛)^2(1−1/n)2
  • picking one particular item in 𝑛 draws is 1−(1−1/𝑛)21- (1 − 1/𝑛)^21−(1−1/n)2

Finally: lim⁡n→∞1−(1−1/n)n=1−1/e≈0.632\lim_{n\to\infty} 1-(1-1/n)^n = 1-1/e \approx 0.632limn→∞​1−(1−1/n)n=1−1/e≈0.632

Nested Cross-valdation

  • Useful for hyperparameter tuning
  • Inner cross-vaildation (e.g., 0.632 bootstapping) to try different hypermarameters
  • Outer cross-validation (e.g., k-folds) to report metric using the best hyptermenter

Hypothesis testing

  • Let u1^,u2^,σ1^2,σ2^2\hat{u_1},\hat{u_2},\hat{\sigma_1}^2,\hat{\sigma_2}^2u1​^​,u2​^​,σ1​^​2,σ2​^​2 be mean and variance of algorithms 1 and 2

Let x=(μ^1−μ^2)nσ^12+σ^22ν=⌊(σ^12+σ^22)2(n−1)σ^14+σ^24⌋ \text{Let} \, x = \frac{(\hat{\mu}_1-\hat{\mu}_2)\sqrt{n}}{\sqrt{\hat{\sigma}_1^2+\hat{\sigma}_2^2}} \qquad \nu = \left\lfloor\frac{(\hat{\sigma}_1^2+\hat{\sigma}_2^2)^2(n-1)}{\hat{\sigma}_1^4+\hat{\sigma}_2^4}\right\rfloor Letx=σ^12​+σ^22​​(μ^​1​−μ^​2​)n​​ν=⌊σ^14​+σ^24​(σ^12​+σ^22​)2(n−1)​⌋

v using upper bound, degrees of freedom of Student's t-distribution

Examples

alt text

Learning with expert advice

In machine learning, the term "expert" typically refers to a model or algorithm with specialized knowledge or predictive ability for a specific task or domain.

Learning from expert advice

  • Learner listens to some / all experts making decisions
  • True outcomes are ADVERSARIAL
  • Learner updates weights over experts based on their loses
  • Algorithm all forms of "multiplicative weights"
  • Nice clean bounds on total mistakes/losses: by "potential function" technique

Adversarial

  • The real outcomes do not come from a fixed probability distribution, but rather may be chosen by an adversary attempting to make the learner fail. This is a conservative assumption that ensures the algorithm performs well even in worst-case scenarios. Under this setting, algorithms need to be adversarial, capable of handling the most unfavorable situations. potential function and clean upper bound Using potential functions, we can derive explicit upper bounds on the algorithm's cumulative number of mistakes or total loss.

Infallible expert(one always perfect)

  • Majority Vote Algorithm

Imperfect experts(none guaranteed perfect)-incresingly better

  • Weighted Majority Vote Algorithm by Halving
  • Weighted Majority Voting by General Multiplicative Weights
  • Probabilistic Experts Algorithm

Infallible Expert Algorithm: Majority Vote

  1. Initialzie set of experts who haven't made any mistakes
  2. Repeat per round
    • Observe perdictions E_i for all i∈{1,…,n}i \in \{1, \ldots, n\}i∈{1,…,n}
    • Make majority prediction arg max⁡y∈{−1,1}∑i∈E1[Ei=y]\argmax_{y \in \{-1,1\}} \sum_{i \in E} 1[E_i = y]argmaxy∈{−1,1}​∑i∈E​1[Ei​=y]
    • Overve correct outcome
    • Remove mistaken experts from E

Mistake bound for Majority Vote Under infallible expert assumption, majiroty vote makes total misktakes M≤log⁡2nM \leq \log_2 nM≤log2​n

Imperfect experts and the Halving Algorithm

No more guarantee of an infallible expert

“Zero tolerance” dropping experts on a mistake is No sense

Imperfect experts: Halving Algorithm

  1. Initialize wi=1w_i = 1wi​=1 weight of expert EiE_iEi​
  2. Repeat per round
    • Observe predicitions EiE_iEi​ for all i∈{1,…,n}i \in \{1, \ldots, n\}i∈{1,…,n}
    • Make weighted majority prediction arg max⁡y∈{−1,1}∑i∈Ewi1[Ei=y]\argmax_{y \in \{-1,1\}} \sum_{i \in E} w_i 1[E_i = y]argmaxy∈{−1,1}​∑i∈E​wi​1[Ei​=y]
    • Oberve correct outcome
    • Downweigh each mistaken expert Eiwi←wi/2E_i \quad w_i \leftarrow w_i / 2Ei​wi​←wi​/2

Mistake bound for Halving
If the best expert makes m mistakes, then weighted majority vote makes M≤2.4(m+log⁡2n)M \leq 2.4(m + \log_2 n)M≤2.4(m+log2​n) mistakes

From Halving to Multiply weights by 1−ϵ1- \epsilon1−ϵ

  1. Initialize wi=1w_i = 1wi​=1 weight of expert EiE_iEi​
  2. Repeat per round
    • Observe predicitions EiE_iEi​ for all i∈{1,…,n}i \in \{1, \ldots, n\}i∈{1,…,n}
    • Make weighted majority prediction arg max⁡y∈{−1,1}∑i∈Ewi1[Ei=y]\argmax_{y \in \{-1,1\}} \sum_{i \in E} w_i 1[E_i = y]argmaxy∈{−1,1}​∑i∈E​wi​1[Ei​=y]
    • Oberve correct outcome
    • Downweigh each mistaken expert Eiwi←wi(1−ϵ)E_i \quad w_i \leftarrow w_i (1- \epsilon)Ei​wi​←wi​(1−ϵ)

Mistaken bound for 1−ϵ1- \epsilon1−ϵ
If the best expert makes mmm mistakes, then 1−ϵ1-\epsilon1−ϵ weighted majority vote makes M≤2(1+ϵ)m+(2log⁡en)/ϵM \leq 2(1+ \epsilon)m + (2\log_e n) /\epsilonM≤2(1+ϵ)m+(2loge​n)/ϵ mistakes

Mini Conclusion

In the case of imperfect experts, the number of mistakes made by the weighted majority algorithm depends on the number of mistakes made by the best expert 𝑚, and the coefficient 2 in this dependency relationship is unavoidable.

The probabilistic experts algorithm

  • Change 1: change from mistakes to loss: Lossℓi(t)∈[0,1]\text{Loss} \ell_i^{(t)} \in [0,1]Lossℓi(t)​∈[0,1] of EiE_iEi​
  • Change 2: Randomised algorithm means, bounding expected losses (It is a risk!)
  1. Initialize wi=1w_i = 1wi​=1 weight of expert EiE_iEi​
  2. Repeat per round
    • Oberve predicitons EiE_iEi​ for all i∈{1,…,n}i \in \{1, \ldots, n\}i∈{1,…,n}
    • Predict EiE_iEi​ of expert i with probability wiW\frac{w_i}{W}Wwi​​ where W=∑j=1nwjW = \sum_{j=1}^{n} w_jW=∑j=1n​wj​
    • Observe losses
    • Update each weight wi←(1−ϵ)ℓi(t)wiw_i \leftarrow (1-\epsilon)^{\ell_i^{(t)}} w_iwi​←(1−ϵ)ℓi(t)​wi​

Expected loss bound
Expected loss of probalistic expert alghorithm is L≤log⁡nϵ+(1+ϵ)L∗L \leq \frac{\log n }{\epsilon} + (1 + \epsilon)L^*L≤ϵlogn​+(1+ϵ)L∗ where L∗L^*L∗ is the minimum loss over experts

MAB

Where we learn to take actions; we receive only indirect supervision in the form of rewards

Multi-Armed Bandit

  • Simplest setting for balancing exploration, exploitation
  • Same familiy of ML tasks as reinforcement learning

Stochastic MAB setting

Possible actions {1,…,k}\{ 1,\ldots, k \}{1,…,k} called "arms"

  • Arm i has distributon PiPiPi on bounded rewards with mean μi\mu_iμi​

In round t=1,…,Tt = 1, \ldots, Tt=1,…,T:

  • Play action it∈{1,…,k}i_t \in \{1, \ldots, k\}it​∈{1,…,k}
  • Receive reward Rit(t)∼pitR_{i_t}(t) \sim p_{i_t}Rit​​(t)∼pit​​

Goal: minimize cumlative regret

  • u∗T−∑t=1TE[Rit(t)]u^* T - \sum_{t=1}^{T} E[R_{i_t}(t)]u∗T−∑t=1T​E[Rit​​(t)] where u∗=max⁡iuiu^* = \max_i u_iu∗=maxi​ui​

Greedy

At round t,

  • Estimate value of each arm i as average reward observed

Qt−1(i)={∑s=1t−1Ri(s)1[is=i]∑s=1t−11[is=i],if ∑s=1t−11[Is=i]>0Q0,otherwise Q_{t-1}(i) = \begin{cases} \frac{\sum_{s=1}^{t-1} R_i(s) \mathbb{1}[i_s = i]}{\sum_{s=1}^{t-1} \mathbb{1}[i_s = i]}, & \text{if } \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] > 0 \\ Q_0, & \text{otherwise} \end{cases} Qt−1​(i)={∑s=1t−1​1[is​=i]∑s=1t−1​Ri​(s)1[is​=i]​,Q0​,​if ∑s=1t−1​1[Is​=i]>0otherwise​

Some init constant Q0(i)=Q0Q_0(i) = Q_0Q0​(i)=Q0​ used until arm i has been puolled

Eploit! it∈arg max⁡1≤i≤kQt−1(i)i_t \in \argmax_{1\leq i \leq k} Q_{t-1}(i)it​∈argmax1≤i≤k​Qt−1​(i)

ϵ\epsilonϵ - Greedy

At round t

  • Estimate value of each arm iii as average reward observed (Q- value function is same as greedy)

Qt−1(i)={∑s=1t−1Ri(s)1[is=i]∑s=1t−11[is=i],if ∑s=1t−11[Is=i]>0Q0,otherwise Q_{t-1}(i) = \begin{cases} \frac{\sum_{s=1}^{t-1} R_i(s) \mathbb{1}[i_s = i]}{\sum_{s=1}^{t-1} \mathbb{1}[i_s = i]}, & \text{if } \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] > 0 \\ Q_0, & \text{otherwise} \end{cases} Qt−1​(i)={∑s=1t−1​1[is​=i]∑s=1t−1​Ri​(s)1[is​=i]​,Q0​,​if ∑s=1t−1​1[Is​=i]>0otherwise​

Exloit:

it∼{arg⁡max⁡1≤i≤kQt−1(i),with probability 1−ϵUnif({1,…,k}),with probability ϵ i_t \sim \begin{cases} \arg \max_{1 \leq i \leq k} Q_{t-1}(i), & \text{with probability } 1 - \epsilon \\ \text{Unif}(\{1, \ldots, k\}), & \text{with probability } \epsilon \end{cases} it​∼{argmax1≤i≤k​Qt−1​(i),Unif({1,…,k}),​with probability 1−ϵwith probability ϵ​

ϵ\epsilonϵ controls exploration vs. exploitation

  • Larger ϵ\epsilonϵ will increase the change to exploration
  • Smaller ϵ\epsilonϵ will tend to choose exploitation

Upper-Confidence Bound (UCB)

Qt−1(i)={μ^t−1(i)+2log⁡(t)Nt−1(i),if ∑s=1t−11[Is=i]>0Q0,otherwise Q_{t-1}(i) = \begin{cases} \hat{\mu}_{t-1}(i) + \sqrt{\frac{2 \log(t)}{N_{t-1}(i)}}, & \text{if } \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] > 0 \\ Q_0, & \text{otherwise} \end{cases} Qt−1​(i)={μ^​t−1​(i)+Nt−1​(i)2log(t)​​,Q0​,​if ∑s=1t−1​1[Is​=i]>0otherwise​

Some init constant Q0(i)=Q0Q_0(i) = Q_0Q0​(i)=Q0​ used until arm i has been puolled

Nt−1(i)=∑s=1t−11[Is=i]μ^t−1(i)=∑s=1t−1Ri(s)1[Is=i]∑s=1t−11[Is=i] N_{t-1}(i) = \sum_{s=1}^{t-1} \mathbb{1}[I_s = i] \quad \hat{\mu}_{t-1}(i) = \frac{\sum_{s=1}^{t-1} R_i(s) \mathbb{1}[I_s = i]}{\sum_{s=1}^{t-1} \mathbb{1}[I_s = i]} Nt−1​(i)=s=1∑t−1​1[Is​=i]μ^​t−1​(i)=∑s=1t−1​1[Is​=i]∑s=1t−1​Ri​(s)1[Is​=i]​

Last Updated:
Contributors: pingzhihe
Prev
Precepction and Artificial Neural Network
Next
Bayesian inference