Skip to main content

Support Vector Machine

An SVM is a linear binary classifier: choosing parameters means choosing a separating boundary (hyperplane).

SVMs aim to find the separation boundary that maximises the margin between the classes.

A hyperplane is a decision boundary that differentiates the two classes in SVM.

  • Hyperplane: a hyperplane is a flat hypersurface, a subspace whose dimension is one less than that of the ambient space.

The hyperplane of the decision boundary in Rm\mathbb{R}^m is defined as

wTx+b=0\mathbf{w}^T \mathbf{x} + b = 0
  • w\mathbf{w}: a normal vector perpendicular to the hyperplane
  • bb : the bias term

The distance from the 𝑖-th point to a perfect boundary is

ri=yi(wxi+b)w\|r_i\| = \frac{y_i(\mathbf{w}^{'} \mathbf{x}_i + b)}{\|\mathbf{w}\|}
  • The margin width is the distance to the closest point

Hard-margin SVM objective

SVMs aim to find:

argminw,b12w2\arg \min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2
  • factor of 12\frac{1}{2} is for mathematical convenience

Subject to constraints:

yi(wxi+b)1 for i=1,,ny_i(\mathbf{w}' \mathbf{x}_i + b) \geq 1 \text{ for } i = 1, \ldots, n

Soft-margin SVM

  • Soft-margin SVM objective:
arg minw,b,ξ12w2+Ci=1nξi \argmin_{\mathbf{w}, b, \xi} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i
  • ξi\xi_i: slack variable

Subject to constraints:

ξi1yi(wxi+b) for i=1,,nξi0 for i=1,,n\xi_i \geq 1 - y_i (w'x_i + b) \text{ for } i = 1, \ldots, n \\[6pt] \xi_i \geq 0 \text{ for } i = 1, \ldots, n

alt text

Kernel Methods

The primal problem of SVM is a convex optimaztion problem. We can utilize Lagrangian and Duality to solve the problem in a more easy way.

Lagrangian multipliers:

  • Transform to unconstrained optimisation
  • Transform primal program to a related dual program, alternate to primal
  • Analyse necessary & sufficient conditions for solutions of both programs

primal program : minxmaxλ0,vL(x,λ,v)min_x max_{\lambda \geq 0, v}\mathcal{L}(x,\lambda, v)
dual program : maxλ0,vminxL(x,λ,v)max_{\lambda \geq 0, v} min_x \mathcal{L}(x,\lambda, v) \quad (Easier to compute!)

Karush-Kuhn-Tucker Necessary Conditions

  • Primal feasibility : gi(x)0g_i(x) \leq 0 and hj(xi)=0h_j(x_i) = 0. The solution xx^* must satisfy the primal consrtraints.
  • Dual feasibility : λ0\lambda \geq 0 for i=1,,ni = 1, \ldots ,n. The Lagrange multiplier λi\lambda_{i}^{*} must be non-negative
  • Complementary slackness : λigi(x)=0, for i=1,,n\lambda_{i}^{*} g_i(x^{*}) = 0 \text{, for } i = 1, \ldots, n. For any inequality constriant, the product of the Lagrange multiplier and the constraint function must be zero.
  • Stionary : xL(x,λ,ν)=0\nabla_x \mathcal{L}(x^*, \lambda^*, \nu^*) = 0. The gradient of the Lagrangian with respect to x must be zero at the optimal solution

Hard-margin SVM with The Lagrangian and duality

L(w,b,λ)=12ww+i=1nλii=1nλiyixiwi=1nλiyib\mathcal{L}(\mathbf{w}, b, \boldsymbol{\lambda}) = \frac{1}{2}\mathbf{w}'\mathbf{w} + \sum_{i=1}^n \lambda_i - \sum_{i=1}^n \lambda_i y_i \mathbf{x}_i'\mathbf{w} - \sum_{i=1}^n \lambda_i y_i b

Applying the staionary contion in KKT:

Lb=i=1nλiyi=0New constraint, Eliminates primal variable bwL=wi=1nλiyixi=0Eliminates primal variable w=i=1nλiyixi\frac{\partial \mathcal{L}}{\partial b} = -\sum_{i=1}^n \lambda_i y_i = 0 \rightarrow \text{New constraint, Eliminates primal variable } b \\[6pt] \nabla_{\mathbf{w}}\mathcal{L} = \mathbf{w}^* - \sum_{i=1}^n \lambda_i y_i \mathbf{x}_i = 0 \rightarrow \text{Eliminates primal variable } \mathbf{w}^* = \sum_{i=1}^n \lambda_i y_i \mathbf{x}_i

Then we make the problem moew easier! Here is:

Dual program for hard-margin SVM

argmaxλi=1nλi12i=1nj=1nλiλjyiyjxixjs.t.  λi0 and i=1nλiyi=0\underset{\lambda}{\text{argmax}} \sum_{i=1}^n \lambda_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \lambda_i\lambda_j y_i y_j \mathbf{x}_i^{\prime}\mathbf{x}_j \\[6pt] \text{s.t.} \; \lambda_i \geq 0 \text{ and } \sum_{i=1}^n \lambda_i y_i = 0

Making predictions with dual solution

s=b+i=1nλiyixixs = b^* + \sum_{i=1}^{n} \lambda_{i}^{*} y_i x'_{i}x

For lagrangian problem: If λi=0\lambda_i = 0 : The sample xix_i is does not impact the decision boundary.

For lagrangian problem in soft-margin SVM the slack variable ξ\xi:

  • if ξ<0\xi < 0, which is theoretically impossible, as ξ\xi represent the deviation from the margin and is defined to be non-negative, i.e., ξ0\xi \geq 0
  • ξ=0\xi = 0 : The variable is correctly classified and lies either outside the margin or exactly on the margin boundary.
  • 0<ξ<10 < \xi < 1 : The sample lies within the margin but is still correctly classified.
  • 1<ξ<21 < \xi < 2 : The sample is misclassified, but the distance to the hyperplane does not exceed the opponent's margin.
  • ξ2\xi \geq 2 : The sample is severely misclassified and is more than one margin width away from the hyperplane.

Soft-margin SVM’s dual

The function for Sofr-Margin is similar. The difference is the constraint of λ\lambda

argmaxλi=1nλi12i=1nj=1nλiλjyiyjxixjs.t. Cλi0 and i=1nλiyi=0\underset{\lambda}{\text{argmax}} \sum_{i=1}^n \lambda_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \lambda_i\lambda_j y_i y_j \mathbf{x}_i^{\prime}\mathbf{x}_j \\[6pt] \text{s.t. } C \geq \lambda_i \geq 0 \text{ and } \sum_{i=1}^n \lambda_i y_i = 0

Kernelising the SVM

A stragety for Feature space transformation

  • Map data into a new feature space
  • Run hard-margin or soft-margin SVM in new space
  • Decision boundary is non-linear in original space

xixjϕ(xi)ϕ(xj)x_i' x_j \rightarrow \phi(x_i)' \phi(x_j)

Polynomial kernel

k(u,v)=(uv+c)dk(u,v) = (u'v + c)^d

Radial basis function kernel

k(u,v)=exp(γuv2)k(u,v) = \exp(- \gamma \|u-v \|^2)

Kernel trick

In machine learning, the kernel trick allows us to compute the inner product in a high-dimensional feature space without explicitly mapping data points to that space.

For two data points xix_i and xjx_j :

xi=[xi1xi2],xj=[xj1xj2]x_i = \begin{bmatrix} x_{i1} \\ x_{i2} \end{bmatrix}, \quad x_j = \begin{bmatrix} x_{j1} \\ x_{j2} \end{bmatrix}

The polynomial kernel function defines an inner product in the original space as:

K(xi,xj)=(xixj+1)2K(x_i, x_j) = (x_i \cdot x_j + 1)^2

Expanding the polynomial:

K(xi,xj)=(xi1xj1+xi2xj2+1)2=xi12xj12+xi22xj22+2xi1xj1xi2xj2+2xi1xj1+2xi2xj2+1K(x_i, x_j) = (x_{i1} x_{j1} + x_{i2} x_{j2} + 1)^2 = x_{i1}^2 x_{j1}^2 + x_{i2}^2 x_{j2}^2 + 2x_{i1} x_{j1} x_{i2} x_{j2} + 2x_{i1} x_{j1} + 2x_{i2} x_{j2} + 1

This result is equivalent to the inner product in a higher-dimensional space without explicitly computing the high-dimensional mapping.