Learning Notes
Home
Home
  • Guide
    • Shortkeys
    • Commands
  • Cpp
    • Basic
    • Scope, Duration, and Linkage
    • Pointers(1)
    • Pointers(2)
  • Rust Learning
    • 1.Number Types
    • 2.Ownership
    • 3.Struct
    • 4.Enum
    • 5.Package-management
    • 6.Vector
    • 7.Error handle
    • 8.Trait 泛型
    • 9.Life time
    • 10.Test
    • 11.Minigrep
    • 12.Closure 闭包
    • 13.Smart pointer 智能指针
    • 14.Fearless concurrency 无畏并发
  • Quantum Computing
    • Basic computing and quantum computing Theory
    • Teleportation
    • QKD and QFT
    • QFE and Shor's algorithm
    • QAOA
    • Quantum algorithms
    • Week 11
  • Statical Machine Learning
    • Bayesian Linear Regression
    • Linear regression and logistic regresion
    • Bais
    • SVM, Kernel Methods
    • Precepction and Artificial Neural Network
    • Cross validation, experts learning, MAB
    • Bayesian inference
    • GMM-EM
    • final
  • Software Project Management
    • Lecture
    • Revision
  • AI-planning
    • Classical planning
    • MDP and reinforcement learning
    • Game theory and final recap

Support Vector Machine

An SVM is a linear binary classifier: choosing parameters means choosing a separating boundary (hyperplane).

SVMs aim to find the separation boundary that maximises the margin between the classes.

A hyperplane is a decision boundary that differentiates the two classes in SVM.

  • Hyperplane: a hyperplane is a flat hypersurface, a subspace whose dimension is one less than that of the ambient space.

The hyperplane of the decision boundary in Rm\mathbb{R}^mRm is defined as

wTx+b=0 \mathbf{w}^T \mathbf{x} + b = 0 wTx+b=0

  • w\mathbf{w}w: a normal vector perpendicular to the hyperplane
  • bbb : the bias term

The distance from the 𝑖-th point to a perfect boundary is

∥ri∥=yi(w′xi+b)∥w∥ \|r_i\| = \frac{y_i(\mathbf{w}^{'} \mathbf{x}_i + b)}{\|\mathbf{w}\|} ∥ri​∥=∥w∥yi​(w′xi​+b)​

  • The margin width is the distance to the closest point

Hard-margin SVM objective

SVMs aim to find:

arg⁡min⁡w,b12∥w∥2 \arg \min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 argw,bmin​21​∥w∥2

  • factor of 12\frac{1}{2}21​ is for mathematical convenience

Subject to constraints:

yi(w′xi+b)≥1 for i=1,…,n y_i(\mathbf{w}' \mathbf{x}_i + b) \geq 1 \text{ for } i = 1, \ldots, n yi​(w′xi​+b)≥1 for i=1,…,n

Soft-margin SVM

  • Soft-margin SVM objective:

arg min⁡w,b,ξ12∥w∥2+C∑i=1nξi \argmin_{\mathbf{w}, b, \xi} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i w,b,ξargmin​21​∥w∥2+Ci=1∑n​ξi​

  • ξi\xi_iξi​: slack variable

Subject to constraints:

ξi≥1−yi(w′xi+b) for i=1,…,nξi≥0 for i=1,…,n \xi_i \geq 1 - y_i (w'x_i + b) \text{ for } i = 1, \ldots, n \\[6pt] \xi_i \geq 0 \text{ for } i = 1, \ldots, n ξi​≥1−yi​(w′xi​+b) for i=1,…,nξi​≥0 for i=1,…,n

alt text

Kernel Methods

The primal problem of SVM is a convex optimaztion problem. We can utilize Lagrangian and Duality to solve the problem in a more easy way.

Lagrangian multipliers:

  • Transform to unconstrained optimisation
  • Transform primal program to a related dual program, alternate to primal
  • Analyse necessary & sufficient conditions for solutions of both programs

primal program : minxmaxλ≥0,vL(x,λ,v)min_x max_{\lambda \geq 0, v}\mathcal{L}(x,\lambda, v)minx​maxλ≥0,v​L(x,λ,v)
dual program : maxλ≥0,vminxL(x,λ,v)max_{\lambda \geq 0, v} min_x \mathcal{L}(x,\lambda, v) \quadmaxλ≥0,v​minx​L(x,λ,v) (Easier to compute!)

Karush-Kuhn-Tucker Necessary Conditions

  • Primal feasibility : gi(x)≤0g_i(x) \leq 0gi​(x)≤0 and hj(xi)=0h_j(x_i) = 0hj​(xi​)=0. The solution x∗x^*x∗ must satisfy the primal consrtraints.
  • Dual feasibility : λ≥0\lambda \geq 0λ≥0 for i=1,…,ni = 1, \ldots ,ni=1,…,n. The Lagrange multiplier λi∗\lambda_{i}^{*}λi∗​ must be non-negative
  • Complementary slackness : λi∗gi(x∗)=0, for i=1,…,n\lambda_{i}^{*} g_i(x^{*}) = 0 \text{, for } i = 1, \ldots, nλi∗​gi​(x∗)=0, for i=1,…,n. For any inequality constriant, the product of the Lagrange multiplier and the constraint function must be zero.
  • Stionary : ∇xL(x∗,λ∗,ν∗)=0\nabla_x \mathcal{L}(x^*, \lambda^*, \nu^*) = 0∇x​L(x∗,λ∗,ν∗)=0. The gradient of the Lagrangian with respect to x must be zero at the optimal solution

Hard-margin SVM with The Lagrangian and duality

L(w,b,λ)=12w′w+∑i=1nλi−∑i=1nλiyixi′w−∑i=1nλiyib \mathcal{L}(\mathbf{w}, b, \boldsymbol{\lambda}) = \frac{1}{2}\mathbf{w}'\mathbf{w} + \sum_{i=1}^n \lambda_i - \sum_{i=1}^n \lambda_i y_i \mathbf{x}_i'\mathbf{w} - \sum_{i=1}^n \lambda_i y_i b L(w,b,λ)=21​w′w+i=1∑n​λi​−i=1∑n​λi​yi​xi′​w−i=1∑n​λi​yi​b

Applying the staionary contion in KKT:

∂L∂b=−∑i=1nλiyi=0→New constraint, Eliminates primal variable b∇wL=w∗−∑i=1nλiyixi=0→Eliminates primal variable w∗=∑i=1nλiyixi \frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \lambda_i y_i = 0 \rightarrow \text{New constraint, Eliminates primal variable } b \\[6pt] \nabla_{\mathbf{w}}\mathcal{L} = \mathbf{w}^* - \sum_{i=1}^n \lambda_i y_i \mathbf{x}_i = 0 \rightarrow \text{Eliminates primal variable } \mathbf{w}^* = \sum_{i=1}^n \lambda_i y_i \mathbf{x}_i ∂b∂L​=−i=1∑n​λi​yi​=0→New constraint, Eliminates primal variable b∇w​L=w∗−i=1∑n​λi​yi​xi​=0→Eliminates primal variable w∗=i=1∑n​λi​yi​xi​

Then we make the problem moew easier! Here is:

Dual program for hard-margin SVM

argmaxλ∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxi′xjs.t.  λi≥0 and ∑i=1nλiyi=0 \underset{\lambda}{\text{argmax}} \sum_{i=1}^n \lambda_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \lambda_i\lambda_j y_i y_j \mathbf{x}_i^{\prime}\mathbf{x}_j \\[6pt] \text{s.t.} \; \lambda_i \geq 0 \text{ and } \sum_{i=1}^n \lambda_i y_i = 0 λargmax​i=1∑n​λi​−21​i=1∑n​j=1∑n​λi​λj​yi​yj​xi′​xj​s.t.λi​≥0 and i=1∑n​λi​yi​=0

Making predictions with dual solution

s=b∗+∑i=1nλi∗yixi′x s = b^* + \sum_{i=1}^{n} \lambda_{i}^{*} y_i x'_{i}x s=b∗+i=1∑n​λi∗​yi​xi′​x

For lagrangian problem: If λi=0\lambda_i = 0λi​=0 : The sample xix_ixi​ is does not impact the decision boundary.

For lagrangian problem in soft-margin SVM the slack variable ξ\xiξ:

  • if ξ<0\xi < 0ξ<0, which is theoretically impossible, as ξ\xiξ represent the deviation from the margin and is defined to be non-negative, i.e., ξ≥0\xi \geq 0ξ≥0
  • ξ=0\xi = 0ξ=0 : The variable is correctly classified and lies either outside the margin or exactly on the margin boundary.
  • 0<ξ<10 < \xi < 10<ξ<1 : The sample lies within the margin but is still correctly classified.
  • 1<ξ<21 < \xi < 21<ξ<2 : The sample is misclassified, but the distance to the hyperplane does not exceed the opponent's margin.
  • ξ≥2\xi \geq 2ξ≥2 : The sample is severely misclassified and is more than one margin width away from the hyperplane.

Soft-margin SVM’s dual

The function for Sofr-Margin is similar. The difference is the constraint of λ\lambdaλ

argmaxλ∑i=1nλi−12∑i=1n∑j=1nλiλjyiyjxi′xjs.t. C≥λi≥0 and ∑i=1nλiyi=0 \underset{\lambda}{\text{argmax}} \sum_{i=1}^n \lambda_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \lambda_i\lambda_j y_i y_j \mathbf{x}_i^{\prime}\mathbf{x}_j \\[6pt] \text{s.t. } C \geq \lambda_i \geq 0 \text{ and } \sum_{i=1}^n \lambda_i y_i = 0 λargmax​i=1∑n​λi​−21​i=1∑n​j=1∑n​λi​λj​yi​yj​xi′​xj​s.t. C≥λi​≥0 and i=1∑n​λi​yi​=0

Kernelising the SVM

A stragety for Feature space transformation

  • Map data into a new feature space
  • Run hard-margin or soft-margin SVM in new space
  • Decision boundary is non-linear in original space

xi′xj→ϕ(xi)′ϕ(xj)x_i' x_j \rightarrow \phi(x_i)' \phi(x_j)xi′​xj​→ϕ(xi​)′ϕ(xj​)

Polynomial kernel

k(u,v)=(u′v+c)d k(u,v) = (u'v + c)^d k(u,v)=(u′v+c)d

Radial basis function kernel

k(u,v)=exp⁡(−γ∥u−v∥2) k(u,v) = \exp(- \gamma \|u-v \|^2) k(u,v)=exp(−γ∥u−v∥2)

Kernel trick

In machine learning, the kernel trick allows us to compute the inner product in a high-dimensional feature space without explicitly mapping data points to that space.

For two data points xix_ixi​ and xjx_jxj​ :

xi=[xi1xi2],xj=[xj1xj2] x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \end{bmatrix}, \quad x_j = \begin{bmatrix} x_{j1} \\ x_{j2} \end{bmatrix} xi​=[xi1​xi2​​],xj​=[xj1​xj2​​]

The polynomial kernel function defines an inner product in the original space as:

K(xi,xj)=(xi⋅xj+1)2 K(x_i, x_j) = (x_i \cdot x_j + 1)^2 K(xi​,xj​)=(xi​⋅xj​+1)2

Expanding the polynomial:

K(xi,xj)=(xi1xj1+xi2xj2+1)2=xi12xj12+xi22xj22+2xi1xj1xi2xj2+2xi1xj1+2xi2xj2+1 K(x_i, x_j) = (x_{i1} x_{j1} + x_{i2} x_{j2} + 1)^2 = x_{i1}^2 x_{j1}^2 + x_{i2}^2 x_{j2}^2 + 2x_{i1} x_{j1} x_{i2} x_{j2} + 2x_{i1} x_{j1} + 2x_{i2} x_{j2} + 1 K(xi​,xj​)=(xi1​xj1​+xi2​xj2​+1)2=xi12​xj12​+xi22​xj22​+2xi1​xj1​xi2​xj2​+2xi1​xj1​+2xi2​xj2​+1

This result is equivalent to the inner product in a higher-dimensional space without explicitly computing the high-dimensional mapping.

Last Updated:
Contributors: pingzhihe
Prev
Bais
Next
Precepction and Artificial Neural Network