Support Vector Machines

A few months ago, whenever I heard the terms Support Vector Machine (SVM) I would imagine something that looks like this:

ImageActually, this happened whenever I tried to make sense out of the mathematics behind it. I admit it, the maths behind SVM is pretty brutal.

It feels like there are very few ML resources out there that describe SVMs in layman’s terms. This post is my attempt to put together a simple illustration on how SVMs work.

First things first…What is SVM?

An SVM is a supervised learning model used for classification and regression analysis.

The simplest form of SVM is one that takes a set of linearly separable input data and predicts, for each given input, which of two possible classes forms the output. In other words: non-probabilistic binary linear SVM.

Consider the simple binary example below, where we’re trying to find a hyperplane that can achieve optimal separation between two datasets.


  • H1 would be a pretty terrible hyperplane, as it groups some of the black points together with the white ones.
  • H2 successfully separates the two datasets, but there is a very small margin between the black points and the separator.
  • H3 can be seen as the optimal separator, since maximum margin was used.

In essence, SVM seeks the optimal H3:

SVM will find you that red line (H3) you see in the figure, with the optimal angle and position which maximizes the separation of the two classes of data. The data points that  “support” this hyperplane on either sides are called the support vectors, and hence the use of the name SVM. Once your SVM has decided an optimal position and geometry for the line,  you can now use it to test other data points you haven’t plotted yet.

But… Is it always that straightforward?

Nope, it’s not. The reality is in realistic scenarios, data is rarely linearly separable and  we often need more than 2 dimensions of features to represent it.

Here comes the ‘cool’ idea behind SVMs:

If your data is not linearly separable, why not ‘trick’ your model into thinking it is? I found the following video to be a great way to illustrate what this means:

Essentially, the idea behind SVMs is to map your data points into other dot space via a nonlinear map. Normally, working with data in a higher dimension means increasing computational expense. However, SVMs make use of a so-called kernel function that can be evaluated easily.

Besides linear SVMs, the most common kernel functions (tricks) are polynomial, radial basis function (RBF) and sigmoid.

Let’s see how we can implement a simple SVM classifier in Orange for our zoo scenario again:

Note that SVMLearner  is what can be used to construct an SVM in Orange. SVMLearner supports several built-in kernel types and even user-defined kernels written in Python.The kernel type is denoted by constants Linear, Polynomial, RBF, Sigmoid and Custom defined in Orange.classification.svm.kernels. 

This post doesn’t dive deep into the specifics of kernel functions available, so we will just implement a simple default SVM.

The following code constructs an SVM classifier. Data elements from 1 to 20 are used for training, whereas the ones from 21 to 40 are used for testing.

import Orange

from Orange.classification import svm

from Orange.evaluation import testing, scoring

data ="")

trainingData = data[1:20]

classifier = svm.SVMLearner(trainingData)

print "CA: %.2f" % scoring.CA(results)[0]

print "AUC: %.2f" % scoring.AUC(results)[0]

for d in data[21:40]:

    c = classifier(d)

    print "%10s; originally %s" % (classifier(d), d.getclass())


Evaluation of Classifier’s Performance II: ROC Curves

The Receiver Operating Characteristic (ROC) curve is a technique that is widely used in machine learning experiments. ROC curve is a graphical plot that summarises how a classification system performs and allows us to compare the performance of different classifiers.

To demonstrate the concept behind ROC curves, let’s consider the zoo scenario again. This time the zoo manager has a different problem: he wants to find the best peanuts for his animals. 



Suppose the zoo manager was able to tell when the animals enjoy the peanuts the most, and he has a sample of particular peanut brands, for which he has carefully assigned the correct classification. PS. Class label ‘1’ denotes ‘Awesome Peanuts’ while ‘0’ means ‘Meh Peanuts’:

Brand  Class
A       0
B       1
C       0
D       1
E       1
F       0
G       0

Now let’s say we also have a machine learning classifier that was trained using ‘peanutty’ features. (I’m not sure what these features would be, but I’m guessing the list includes: whether they are roasted, country of origin, whether they are salted, etc). Our classifier analyses each brand’s features and assigns a ‘happiness score’, which would resemble the classifier’s prediction on whether the animal will enjoy the peanuts.

Our classifier spits out the following ‘happiness scores’:

        Brand  Class  Assigned Score
A       0                 9
B       1                 6
C       0                 5
D       1                 4
E       1                 2
F       0                -1
G       0                -2

Now let’s order the peanut brands according to their assigned scores as follows:

            Highest rated brand                                                  Lowest rated brand

0     1      0      1        1        0         0


Looking at this order of scores shows us that we have used a not-so-good classifier. A better one would order the instances so that we have all the ones on the left and all the zeros on the right. Anyway, now we have enough information to finally plot our ROC curve.

The ROC space has 2 axes, each having a maximum value of 1. The lower left corner is the origin, the x-axis is for the negative labels and the y-axis is for the positive labels.


ROC space

Now it’s time to use our sorted scoring vector. Draw a path on the ROC graph starting from the origin (the bottom left corner). Every time there’s a 1 in your vector, the path will step up. Every time there is a 0, the path will step right. Since our sorted scoring vector is  [0  1  0  1  1  0  0]   our ROC curve will look as follows:

ROC Curve

ROC Curve

And now we have a summary of our classifier’s performance plotted on ROC curve!

Now imagine a line drawn from the bottom left corner to the top right corner. That diagonal line would divide the space into good and bad results. Essentially, that diagonal would mimic a random classifier and any points lying above it have a better performance than a random classifier. A perfect classifier would have results at the top left corner of the ROC space.

Can we calculate anything from an ROC curve?

We can compute the area under curve (AUC) from ROC plots. In our graph, that is essentially the area below the red line. A perfect classifier will have an AUC of 1. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).

PS. ROC curves can be generated using Orange Canvas as shown here:

Evaluation of Classifier’s Performance

In the previous posts we have discussed how we can use Orange to design a simple Bayesian classifier and assess its performance in Python. This post is focused on an important aspect that needs to be considered when using Machine Learning algorithms: how do you evaluate the performance of the classifier that you designed? After all, we need to have certain methodologies that we can apply in order to be able to compare different classification algorithms and understand how well each of them performs with our data.

There are two main areas that need to discussed: Testing and Scoring.

1. Testing your models:

The aim of this step is  to introduce your model to unseen data after it has been trained with a training dataset, in order to test how well it will do if it is implemented within your application.  A simple approach is to to decide on certain proportions for your two sets, for instance 70:30. In this case, 70% of your data will be selected for training and the other 30% will be used to test the model. The Python code below shows how, in our Zoo classifier problem, we can create  a proportion test object called ‘res’ that uses 70% of the data as a training set for a Bayesian algorithm. The test is then repeated 50 times to ensure realistic results (default value is 10). Note that the four modules that we imported here are all available upon downloading Orange.

import Orange, orange, orngTest, orngStat
data ="")
bayes = Orange.classification.bayes.NaiveLearner()
res = orngTest.proportionTest([bayes], data, 0.7, 50)

A second way to test your models is to use the cross validation technique. This technique is based on the idea of dividing your data into k number of equally-sized folds. So let’s say we choose to divide our data into 3 folds: A, B and C. Initially, folds A and B will be used for training, while C will be used for testing. We then use folds A and C for training and B for testing, and so on. The following diagram that I obtained from here demonstrates this in a clear way.


In order to use 3-fold cross validation to test your model, replace the previous proportion test line with the following:

res = orngTest.crossValidation([bayes], data, 3)

A specific case of cross validation is the leave-one-out approach (LOOCV). This approach makes use of the number of instances in our dataset as the value of k. A single observation from the dataset is used for validation, and the remaining observations as the training data. This means that if we have 100 records, we’ll need to divide them into 100 folds, use 99 for training and 1 for testing. The process is then repeated 100 times. Using LOOCV, we usually obtain almost unbiased accuracy estimates. However, variance tends to be high which might be unreliable. To use LOOCV in Python using Orange’s libraries, replace the test line with the following:

res = orngTest.leaveOneOut([bayes], data)


2. Scoring

Now that you have applied a validation technique on your data, it is necessary to have a quantitative way of evaluating your classification model, by measuring whether the model assigns the correct class value to the test instances. But before we discuss these scoring measures, it is necessary to understand the concept of a confusion matrix.

Consider a classification problem where you only have two classes: positive and negatives. Each instance in your data is mapped to either a positive or a negative label. Given a classifier and an instance, there are four possible outcomes:

  • True Positive (TP):    If the instance is positive and it is classified as positive
  • False Negative (FN): If the instance is positive but it is classified as negative
  • True Negative (TN):  If the instance is negative and it is classified as negative
  • False Positive (FP):   If the instance is negative but it is classified as positive

(Yes, sounds a little confusing.. just read it again and it will start making more sense…)

The outcomes of the classification test can then be summarised in a confusion matrix, as shown below:


OK, now we have picked a validation technique, tested our model, worked out the confusion matrix…but HOW do we assess our model’s performance?

Well, there is a number of ways (usually referred to as scores) to evaluate how well your model assigns the correct class value to the test instances. Many of these scores can be calculated by using the values held in the confusion matrix.

Accuracy : This is the simplest scoring measure. It calculates the proportion of correctly classified instances.

Accuracy = (TP + TN) / (TP+TN+FP+FN)

Sensitivity (also called Recall or True Positive Rate): Sensitivity is the proportion of actual positives which are correctly identified as positives by the classifier.

Sensitivity = TP / (TP +FN)

Specificity  (also called  True Negative Rate) : Specificity relates to the classifier’s ability to identify negative results. Consider the example of medical test used to identify a certain disease. The specificity of the test is the proportion of patients that do not to have the disease and will successfully test negative for it. In other words:

Specificity: TN / (TN+FP)

Precision: This is a measure of retrieved instances that are relevant. In other words:

Precision: TP/(TP+FP)

The scoring measures discussed above can be used in Python using Orange’s evaluation functions as shown below:

print "Accuracy: %.2f" % Orange.evaluation.scoring.CA(res)[0]

print "Sens: %.2f" % Orange.evaluation.scoring.Sensitivity(res)[0]

print "Spec: %.2f" % Orange.evaluation.scoring.Specificity(res)[0]

print "Precision: %.2f" % Orange.evaluation.scoring.Precision(res)[0]

This concludes this post and the discussion about evaluation measures. There are additional concepts that are quite important in the area of assessing classification performance, but the techniques discussed in this post form the necessary foundation behind them.

Bayesian Classifier

In the previous post we saw how we can use Orange to write a simple Naive Bayes classifier in Python. This post is devoted to elaborating on the principles based on which Naive Bayes works. To start with, Naive Bayes is a Probabilistic Model. Probabilistic models are used in classification scenarios where we cannot compute the outcome of an event with complete certainty. These models are based on the idea of computing probabilities of the input being a member of all the possible categories. The category with the highest probability is often selected as the outcome of the classification task.

But…how does it work?

Naive Bayes classifier is based on a statistical concept called Bayesian rule.  To demonstrate, consider two events A: it is going to rain tomorrow and B: it will not rain tomorrow. Now, if you are asked about the probability that it will rain tomorrow, it is intuitive to think that since it’s either going to rain or not,  there is a 50% chance for each event…

Actually, things work differently in Bayesian lands…

Bayesian theorem argues that the probability of an event taking place changes if there is information available about a related event. This means that if you recall the previous weather conditions for the last week, and you remember that it has actually rained every single day, your answer will no longer be 50%. In other words, the Bayesian approach provides  a way of explaining how you should change your existing beliefs in the light of new evidence. It allows scientists to combine new data with their existing knowledge or expertise.

Bayesian theorem suggests that it is most likely going to rain tomorrow

A Naive Bayes classifier would suggest that it is most likely going to rain tomorrow

Bayesian rule’s emphasis on prior probability makes it better suited to be applied in a wide range of scenarios. Consider the following comic that I obtained  from xkcd:

Two statisticians, a frequentist and a bayesian, discovered a machine that supposedly measures whether the Sun has gone nova. The machine also rolls two dice, and if they both come up to 6, the machine lies to us. Otherwise, the machine tells the truth.

The machine claims that the Sun has gone nova.

The frequentist statistician works out the probability of the machine lying to them. It comes up to be 1/(6*6) = 0.027. Since this seems like a very small chance ( less than 0.05), he believes the machine.

The bayesian statistician simply applies the concept of prior probability and disagrees with the outcome of the machine.


Sounds good..but what are possible disadvantages of using the Naive Bayes classification algorithm?

Naive Bayes assumes that that the input features are independent of each other. In other words, it assumes that the  occurrence of a feature doesn’t affect the occurrence of any other feature, and hence the prefix Naive. In realistic scenarios, you are very likely to be dealing with conditionally dependent features. However, it is still a highly effective technique and is widely used in many classification scenarios.

Build a Zoo Classifier using Naive Bayes

Say you write software at a lab somewhere. You decided to visit your local zoo over the weekend and you end up having  a nice chat with the zoo manager. The zoo manager knows you’re a computer guy, and he also knows a little bit about technology jargon. One day your local zoo manager walks up to you and gives you a box labelled “animals”.

He asks if you  can help him by writing a program to classify this data. He then walks away because he’s a busy man.


Your raw data

You have no idea what’s in the box. It looks too small to have actual rabbits in. You decide to open the box, and you find piles of paper that have nothing but numbers written on them. Rows and rows of numbers. You know that you can use machine learning to ‘classify’ data, but you also need to know what exactly do these numbers mean. In order to use machine learning, you need to have a meaningful interpretation of your data. You run after the zoo manager and try to find out what do these numbers mean. He tells you that the intern he hired over the summer spent a lot of time taking pictures of animals in the zoo. Now that makes more sense! Computers are good at interpreting numbers, so it seems like  these papers hold  pixel values of animal images.

The zoo manager gives you a USB stick with electronic copies of the piles of paper. Now that makes your life easier. But we mentioned in the previous post that we need to have features  in order to use ML algorithms. Your data is so far just a bunch of images that the intern spent his time taking. We still don’t have any attributes that can be used to analyse patterns of animals. Wait! You open the README.txt file and it indicates where you can find features that the intern extracted using these images! That makes your life A LOT easier. ‘Animaly’ features are probably things like whether the animal has hair, feathers, fins, lays eggs, etc. These are simple boolean features that are probably good enough to classify the limited number of animals in your local zoo.

Now your problem is simply writing some code that predicts an animal type using information that you currently hold about existing animals in the zoo (labelled data). You have features that describe your animals. Patterns exist within these features, as we discussed above. You don’t have a mathematical formula that can do that work for you. Its a supervised ML problem, more specifically, a classification task. You still don’t know how complex your proposed system will be..after all, you’re teaching it how to think..

AI comic

Let’s see how we can use Orange to do that. Luckily, it comes with a file that has a list of records, each holding a bunch of animal features. Each record is labelled with the class of animal. Much of Orange is devoted to machine learning methods for classification, or supervised data mining. These methods rely on the data with class-labeled instances, which we have in the file.

Classification tasks use two types of objects: learners and classifiers. Learners analyse your labeled data and form a model that can classify future instances. Much like training a baby. They have to see a number of cats and dogs so that they build a mental model of their features, and hence be able to differentiate between them.

If you load up Orange Canvas, you’ll be able to script in python using the Python Script widget. It essentially gives you a little environment where you can write executable script as well as a scripting console. The widget looks like this:

Orange's Python Script Widget

Orange’s Python Script Widget

Ok so how do we build a zoo classifier using our feature set?

data ="")

bayes = Orange.classification.bayes.NaiveLearner()

res = Orange.evaluation.testing.cross_validation([bayes], data, folds=5)

print "Accuracy: %.2f" % Orange.evaluation.scoring.CA(res)[0]

The code is as straightforward as it looks. You create an object called ‘data’ that holds up your table. You then create another object (let’s call it bayes) that acts as a Naive Learner . But you also need to know have an idea on how well Naive Bayes will perform on unseen datasets once it has learned from the ones that were given to you in the box. One common way is to split your data into a training and a testing set. However, the code above makes use of a technique called Cross Validation, which is commonly used if you don’t have a particular dataset that you want to use for testing. You then simply display your classifier’s accuracy that was obtained from the result of the cross validation process. That’s it!

Machine learning problems could get way more complicated than this, but you can see how with Orange, 4 lines of code are able to analyse our features, apply a learning algorithm, and test out how well it does!


One of the comments suggested that I show an example of how the classifier we defined can go through a test file and outputs classification results. Let’s do this!

We start by simply importing the zoo file again.

Since there is no specific test file, let’s split our zoo file as follows: Elements 1 to 20 will be used for training, and elements 21 to 40 will be used for testing. You should never use the same data for both training as testing! By splitting our file into separate parts, we now have a training set and a testing set. This is good.

We train our classifier on the training set. We then apply the trained classifier into our test set. In order to inspect the prediction versus the original class, we display both information for each element in the test set.

import Orange

data ="")

trainingData = data[1:20]

learner = Orange.classification.bayes.NaiveLearner()

classifier = learner(trainingData)

for d in data[21:40]:
   c = classifier(d)
   print "%10s; originally %s" % (classifier(d), d.getclass())

Your results will be something like that:
mammal; originally bird
mammal; originally bird
mammal; originally mammal
mammal; originally bird
mammal; originally insect
mammal; originally amphibian
mammal; originally amphibian
mammal; originally mammal
mammal; originally mammal
mammal; originally mammal
mammal; originally insect
mammal; originally mammal
mammal; originally mammal
mammal; originally bird
fish; originally fish
mammal; originally mammal
mammal; originally mammal
mammal; originally bird
fish; originally fish
mammal; originally insect


So when can we refer to our problem as a machine learning task?

I watched a bunch of online lectures and read several articles, until it came apparent that the best answer to this question was given by Professor Yaser Abu-Mustafa of Caltech. Professor Abu-Mustafa identifies three components of any problem for it to fall under the ML category:

1. A pattern exists. If there is no pattern within your data then there is nothing to look for. If you’re a banker and you’re looking for the best credit formula that determines whether to accept or reject a customer for a loan, your patter is probably his salary, credit history, number of years living in the country, etc. Similarly, if you’re building the next Netflex and are thinking of using ML tools, it is a good idea since you have patterns that are related to customer’s favourite movies.

2. Can’t pin it down mathematically. If we have a mathematical polynomial that captures how customers rate their interests on Netflex, we wouldn’t be looking for ML tools that help us understand their behaviour.

3. You have data. ML is all about data. If you don’t have data to analyse, there’s nothing the machine can learn.

(For an awesome 80 minute elaborate discussion on the above 3 points: Professor Abu-Mustafa’s ML Lecture 01. )


Features form the core of any ML problem, since ML is concerned with using the right features to build the right models that achieve the right task. In most cases, features are numerical measurements that describe properties of what is being observed. Suppose you are working on building a tool that detects email spam.  A feature could be the number of times the word  viagra appears in the received email. Another feature could be whether a subject line is present in the email. If you are working on an image analysis task then your features would probably be related to the shape and colour of objects in these images.


While different scenarios obviously have different features, once the features are decided, It would be necessary to choose a model that makes the most out of these features. The model could be geometric (such as linear classifiers or SVMs), probabilistic (such as the Bayesian approach) or Logic-based, such as decision-trees. We will explore various models and how to use them within Orange  in later posts.

print(“Hello, World!”)

Right.. So for the last year, I’ve been experimenting with machine learning technologies to see if we can build computational models that help us better understand MRI images. A couple of days ago, I had a chat with a colleague who was looking for the best way to analyse his data that he uses for his research with the automotive industry. After spending some time with him, I realised how useful ML techniques could be, no matter what your areas of interest are. I knew that at the back of my mind, but when I actually saw a lot of the algorithms I use being applied in a completely different field, I thought why not talk about ML on a blog? Also, would be a cool way of keeping logbook entries.

Before I start posting anything, I’d like to acknowledge my favourite programming language:


Thanks to python, I looked around over a year ago for a python-based ML library (more about that in a minute). If you’ve never used python, but know any other programming language, I definitely suggest learning it asap. Not convinced? You’ll never have to worry about those frustrating semicolons again!

(Awesome Crash Course!)

So, now about that cool ML library. Orange. It’s intuitive and easy to use. You can rapidly design your models using its visual programming interface, or via python scripting. So if I didn’t do a good job convincing you to learn python, you can always take the lazy man approach to building your models (visual programming). I do that a lot. And once they are good enough, I can write a script that works well.

So to wrap up:

  • I like looking at MRI images and trying to see if a computer can do a better job
  • I like python
  • Orange is an awesome ML library that I’ll be using throughout this blog