The Math behind Linear SVC Classifier

This article is reposted from my Kaggle account. Here is the original link:

https://www.kaggle.com/xingewang/the-math-behind-linear-svc-classifier

In case there are some Latex format altered below.

Linear Support Vector Classifier (Binary Case)

The reason why I wrote this article: explanation of SVC/SVM on the internet is overwhelming, but I could not find anything to clarify all the doubts I have in mind regarding the math part of the Linear Support Vector Classifier. This article is for people who have the same doubts like me, I hope it will be helpful.

Background knowledge required: Understanding the geometric insights of vector dot product (one vector project to the other vector); Lagrange multiplier (for optimizaiton).

Notation: X is your sample dataset, it contains N samples, P features. Xij is the ith sample, jth feature. Xi is the vector describe all the P features of sample i. X+ is the sample from class “+”,X- is the sample from class “-“. Y is the label vector, Yi has 2 values +1 and -1.

Logic behind Linear SVC:

Samples are data points allocate in the P-dimension space, each axis represents one feature. The idea of the Support Vector Classifier is to find the “hyperplane” to seperate samples into 2 classes in this P-dimension space. In a 2-D space, when we only have 2 features, SVC is to find the straight line that can seperate samples.

Of course, different people will draw different straight lines. However, no matter how you draw the line between these 2 classes, you will always find a closest sample from each class to your straight line. That is the SUPPORT VECTOR. And the distance of these closest samples to your straight line describe how well you separate these two classes, that is the MARGIN. (Since there too many tutorials of SVM, all the figures in this article are from the web. )

So, how to draw this line?
First, we need to make the straight line “unbiased” to any class. So the distance from + support vector to the line should equal to the – support vector;
Second, if the margin value is small, it means it will be too sensitive to these support vectors. If you change dataset, the support vector will vary and your classifier will not be robust. As the figure below, all the lines are biased or “support vector sensitive”.

The goal of the Linear SVC is to MAXIMIZE the width of the margin.

Describe the problem in a mathmetical way first.

∙ The straight line in the simple case above can be expressed as:
β0+β0∗x1+β1∗x2=0.
Linear algebra friendly, I’d like to think it as ,
β0∗x1+β1∗x2=−β0, that is the dot product of 2 vectors: w⃗ =[β0,β1], and x⃗ =[x1,x2].
w⃗ is the vector that is perpendicular to the straight line, and x⃗ is the point on the straight line.
And of course, here is a more general way to quantify the “hyperplane” we are interested to find:
β0+β0∗x1+β1∗x2..+βp∗xp=0
or w⃗ ∙x⃗ +b=0 .

∙ Once we have the “hyperplane”, we can define our DECISION RULE.
We’d like for “+” and “-” class sample, they will lie on different side of the “hyperplane”.
For support vectors, to describe it in mathmatial way is w⃗ ∙X⃗ ++b=+1 and w⃗ ∙X⃗ −+b=−1.
For other samples, is w⃗ ∙X⃗ ++b>+1 and w⃗ ∙X⃗ −+b<−1.
Put label vector Y to make these two functions into one for mathmatical convinience Yi(w⃗ ∙X⃗ i+b)>=1.

∙ Describe MARGIN in a mathmatical way.
Pair of support vectors, X+−X− projects onto w⃗ is the width of the margin. (If you understand the geometric insights of vector dot product, this will be so easy to understand).

So, (X⃗ +−X⃗ −)∙w⃗ ||w|| is the width of the margin.
Since Yi(w⃗ ∙X⃗ i+b)=1 is true,
X⃗ +w⃗ =1−b, and X⃗ −w⃗ =−1−b,
We can rewrite (X⃗ +−X⃗ −)∙w⃗ ||w|| into 2||w||.
And our goal is to MAXIMIZE 2||w||.
Again, for mathmatical convenience, we can just get the MINIMIZE of 12||w||2.

Optimize our goal function.
12||w||2 function is our target function, but it comes with a constrained function Yi(w⃗ ∙X⃗ i+b)=1.
LAGRANGIAN TRANSFORM is a tool to find the optimization of a target function with constrains.
3blue1brown gives a wonderful insight into this tool, I’ll paste the link here if you are interested.https://www.youtube.com/watch?v=hQ4UNu1P2kw
Now, let’s pack them together.
L = 12||w||2 – ∑αi[Yi(w⃗ ∙X⃗ i+b)−1]
∂L∂w=0, we will have w⃗ =∑αiYiXi.
∂L∂b=0, we will have ∑αiYi=0.
Plug these 2 euqations back to L,
L = 12(∑αiYiX⃗ i)(∑αjYjX⃗ j)−∑αi[Yi(∑αjYjX⃗ i∙X⃗ j+b)−1] = −12∑ijαiαjYiYjX⃗ i∙X⃗ j+∑αi.
From this we can see, the classifier is only depends on the dot product of the sample.

Algorithm for computer to find the solution for the L equation above.
SMO is one of the most popular algorithms there. This article only focuses on the math part, so I will not talk about it here.

Soft margin SVC
When you use the python Linear SVC function, there are 2 parameters you can control. One is the alpha, which is the same alpha we talked about it before. The larger alpha is, the wider your margin is.
The other parameter is C, it comes from “Soft margin SVC”.
Since most datasets are not so ideal to seperate perfectly by a straight line, we can allow a few samples to go across it, which we call soft margin. Like this,

Then we need to rewrite our model into,
Yi(w⃗ ∙X⃗ i+b)=1−Ci.
The larger the C is, the more mislabeled sample you allowed. It will increase the robustness of your model, but it will decrease your precision.

About the Kernels
Another import part of SVC is the kernel. It allows user to project the P-dimension dataset into other dimensions for a better seperation.
A kernel is a tranform function. It takes in P-dimension dataset and spits out the transformation new dataset.
One simple example is, a 2D linear non-seperatable dataset, after transforming it into 3D, it becomes seperable.

About the Normalization

Another thing to keep in mind is, before you feed the SVC model, please make sure you’ve scaled your data or normalized it. Cause it will affect the result a lot.

The Math behind Linear SVC Classifier

Published by generanger

Leave a comment Cancel reply

共享此文章：

Published by generanger

Leave a comment Cancel reply