Why use SVM?

by Greg Lamp

Support Vector Machine has become an extremely popular algorithm. In this post I try to give a simple explanation for how it works and give a few examples using the the Python Scikits libraries. All code is available on Github. I'll have another post on the details of using Scikits and Sklearn.

What is SVM?

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to seperate your data based on the labels or outputs you've defined.

So what makes it so great?

Well SVM it capable of doing both classification and regression. In this post I'll focus on using SVM for classification. In particular I'll be focusing on non-linear SVM, or SVM using a non-linear kernel. Non-linear SVM means that the boundary that the algorithm calculates doesn't have to be a straight line. The benefit is that you can capture much more complex relationships between your datapoints without having to perform difficult transformations on your own. The downside is that the training time is much longer as it's much more computationally intensive.

Cows and Wolves

So what is the kernel trick?

The kernel trick takes the data you give it and transforms it. In goes some great features which you think are going to make a great classifier, and out comes some data that you don't recognize anymore. It is sort of like unraveling a strand of DNA. You start with this harmelss looking vector of data and after putting it through the kernel trick, it's unraveled and compounded itself until it's now a much larger set of data that can't be understood by looking at a spreadsheet. But here lies the magic, in expanding the dataset there are now more obvious boundaries between your classes and the SVM algorithm is able to compute a much more optimal hyperplane.

For a second, pretend you're a farmer and you have a problem--you need to setup a fence to protect your cows from packs of wovles. But where do you build your fence? Well if you're a really data driven farmer one way you could do it would be to build a classifier based on the position of the cows and wolves in your pasture. Racehorsing a few different types of classifiers, we see that SVM does a great job at seperating your cows from the packs of wolves. I thought these plots also do a nice job of illustrating the benefits of using a non-linear classifiers. You can see the the logistic and decision tree models both only make use of straight lines.

Code for generating the plots found here: gist

Let SVM do the hard work

In the event that the relationship between a dependent variable and independent variable is non-linear, it's not going to be nearly as accurate as SVM. Taking transformations between variables (log(x), (x^2)) becomes much less important since it's going to be accounted for in the algorithm. If you're still having troubles picturing this, see if you can follow along with this example.

Let's say we have a dataset that consists of green and red points. When plotted with their coordinates, the points make the shape of a red circle with a green outline (and look an awful lot like Bangladesh's flag).

What would happen if somehow we lost 1/3 of our data. What if we couldn't recover it and we wanted to find a way to approximate what that missing 1/3 looked like.

So how do we figure out what the missing 1/3 looks like? One approach might be to build a model using the 80% of the data we do have as a training set. But what type of model do we use? Let's try out the following:
- Logistic model
- Decision Tree
- SVM

I trained each model and then used each to make predictions on the missing 1/3 of our data. Let's take a look at what our predicted shapes look like...

Code for generating the plots found here: gist

Results

From the plots, it's pretty clear that SVM is the winner. But why? Well if you look at the predicted shapes of the decision tree and GLM models, what do you notice? Straight boundaries. Our input model did not include any transformations to account for the non-linear relationship between x, y, and the color. Given a specific set of transformations we definitely could have made GLM and the DT perform better, but why waste time? With no complex transformations or scaling, SVM only misclassified 117/5000 points (98% accuracy as opposed to DT-51% and GLM-12%! Of those all misclassified points were red--hence the slight bulge.

When not to use it

So why not use SVM for everything? Well unfortunately the magic of SVM is also the biggest drawback. The complex data transformations and resulting boundary plane are very difficult to interpret. This is why it's often called a black box. GLM and decision trees on the contrary are exactly the opposite. It's very easy to understand exactly what and why DT and GLM are doing at the expense of performance.