# ML Series6: Support Vector Machine

A Theoretically Beautiful Algorithm 🎆

Pre 1980s, almost all learning methods learned linear decision surfaces, and they have nice theoretical properties. In 1980s, decision trees and neural networks have allowed efficient learning of non-linear decision surfaces, but has little theoretical basis and suffer from local minima. When it comes to 1990s, Support Vector Machine(SVM) was developed, which has nice theoretical properties and can avoid local minima.

In this post, I will discuss the form of its corresponding **convex optimization problem**, and then how to use Lagrange Multiplier to solve its **dual problem**. Additionally, it will introduce **kernel trick**. Please note at this moment, I’m only going to show the naive version(hard-margin). I probably will add the soft margin version in the future.

# Overview

SVM is a classic binary classifier. Given training set D = {(x1, x2), (x2, y2)…(xm, ym)}, yi ∈ {-1, 1}, we want to find a hyperplane in sample space to separate different classes. SVM decides the best line is the one with maximum margin. Every sample should fall outside of the margin. The larger the margin, the more confident we are that unseen samples will be correctly classified.

In sample space, we define the hyperplane divider as **w’x + b = 0**, where w is the normal vector. Pick any point x’ in the sample space, the distance from x’ to the hyperplane is

## Proof

Pick any point on the hyperplane x, the distance from x’ to the hyperplane is equal to the projection of (x’-x) on the normal vector w.

We assume that the hyperplane can **correctly classify** the samples, then if w’x +b ≥ 1, label should be positive; if w’x + b ≤ -1, the label should be negative. To make these two inequalities cleaner, we will introduce a variable y such that y = +1 for positive samples, y = -1 for negative samples. Thus we can make our inequalities to y(w’x +b) ≥1. This is our assumption.

**Support vectors** are defined as the samples which are closest to the hyperplane. Therefore, we scale to make the samples closest to hyperplane satisfy w’x +b = 1 and w’x + b = -1. (as shown left)

After plugging in distance equation we derived, the distance between two different class support vectors are 2/||w||. To find the hyperplane with the **maximum margin** is to find w and b that satisfy

We can rewrite the maximization to a minimization problem.

And this is the basic form of Support Vector Machine, a quadratic programming function.

# Optimization — Duality Problem

- Duality problem is easier to optimize
- We can get the form of inner product, where we can apply
**kernel method**

Because of above reasons, we want to transform the problem to a dual problem. We can do it by applying Lagrange Multiplier

Plug in the w = Σ αyx to original hyperplane formula, we get

Because of KKT condition, we need to satisfy

We can notice that for non-support vectors, their Lagrange Multiplier must be 0. So to get f(x), we only need to compute the inner product between new samples and support vectors.

# Kernel Tricks

Inner product has other great strength here!

Graphically,

- Two very similar xi, xj vectors that predict different classes tend to maximize the margin width

2. Two vectors that are similar but predict the same class are redundant

3. Two dissimilar (orthogonal) vectors don’t count at all

We can use Kernel Tricks to increase the feature space, without explicit visit to higher dimensional feature space. This is another powerful benefits brought by the dual form: we don’t need any additional heavy computation during training and prediction, and we can increase the feature space for free!

For sample space that is not linearly separable like below, we can use kernel trick to to create nonlinear classifiers.

Original Dual problem becomes

Classification function becomes

Some common kernels include

# Interview Questions

- Explain SVM to a non-technical person.
- Can you explain SVM？
- What is the geometric intuition behind SVM?
- Explain the Dual form of SVM formulation?
- Explain what is hinge loss?
- What’s the “kernel trick” and how is it useful?
- Should you use the primal or the dual form of the SVM problem to train a model on a training set with millions of instances and hundreds of features?
- Explain about SVM Regression?
- Give some situations where you will use an SVM over a RandomForest Machine Learning algorithm.
- Can we apply the kernel trick to logistic regression? Why is it not used in practice then?

**Thanks for reading the article! Hope this is helpful. Please let me know if you need more information.**