Confusion Matrix in Machine Learning

7 min readJan 14, 2021

In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see whether the system is confusing two classes (i.e. commonly mislabeling one as another).

“A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.”

Example

Given a sample of 13 pictures, 8 of cats and 5 of dogs, where cats belong to class 1 and dogs belong to class 0,

actual = [1,1,1,1,1,1,1,1,0,0,0,0,0], assume that a classifier that distinguishes between cats and dogs is trained, and we take the 13 pictures and run them through the classifier, and the classifier makes 8 accurate predictions and misses 5: 3 cats wrongly predicted as dogs (first 3 predictions) and 2 dogs wrongly predicted as cats (last 2 predictions).

prediction = [0,0,0,1,1,1,1,1,0,0,0,1,1] With these two labelled sets (actual and predictions) we can create a confusion matrix that will summarize the results of testing the classifier:

In this confusion matrix, of the 8 cat pictures, the system judged that 3 were dogs, and of the 5 dog pictures, it predicted that 2 were cats. All correct predictions are located in the diagonal of the table (highlighted in bold), so it is easy to visually inspect the table for prediction errors, as they will be represented by values outside the diagonal.

In abstract terms, the confusion matrix is as follows:

Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix

True Positive (TP)

The predicted value matches the actual value The actual value was positive and the model predicted a positive value

True Negative (TN)

The predicted value matches the actual value The actual value was negative and the model predicted a negative value

False Positive (FP) — Type 1 error

The predicted value was falsely predicted The actual value was negative but the model predicted a positive value Also known as the Type 1 error

False Negative (FN) — Type 2 error

The predicted value was falsely predicted The actual value was positive but the model predicted a negative value Also known as the Type 2 error

The different values of the Confusion matrix would be as follows:

True Positive (TP) = 5; meaning 5 positive class data points were correctly classified by the model

True Negative (TN) = 3; meaning 3 negative class data points were correctly classified by the model

False Positive (FP) = 2; meaning 2 negative class data points were incorrectly classified as belonging to the positive class by the model

False Negative (FN) = 3; meaning 3 positive class data points were incorrectly classified as belonging to the negative class by the model

5+3/5+3+2+3 = 0.61

Why Do We Need a Confusion Matrix?

Before we answer this question, let’s think about a hypothetical classification problem.

Let’s say you want to predict how many people are infected with a contagious virus in times before they show the symptoms, and isolate them from the healthy population . The two values for our target variable would be: Sick and Not Sick.

Now, you must be wondering — why do we need a confusion matrix when we have our all-weather friend — Accuracy? Well, let’s see where accuracy falters.

Our dataset is an example of an imbalanced dataset. The total outcome values are:

TP = 30, TN = 930, FP = 30, FN = 10

So, the accuracy for our model turns out to be:

96%! Not bad!

But it is giving the wrong idea about the result. Think about it.

Our model is saying “I can predict sick people 96% of the time”. However, it is doing the opposite. It is predicting the people who will not get sick with 96% accuracy while the sick are spreading the virus!

Do you think this is a correct metric for our model given the seriousness of the issue? Shouldn’t we be measuring how many positive cases we can predict correctly to arrest the spread of the contagious virus? Or maybe, out of the correctly predicted cases, how many are positive cases to check the reliability of our model?

This is where we come across the dual concept of Precision and Recall.

Precision vs. Recall

Precision tells us how many of the correctly predicted cases actually turned out to be positive.

This would determine whether our model is reliable or not.

Recall tells us how many of the actual positive cases we were able to predict correctly with our model.

We can easily calculate Precision and Recall for our model by plugging in the values into the above questions:

50% percent of the correctly predicted cases turned out to be positive cases. Whereas 75% of the positives were successfully predicted by our model. Awesome!

Precision is a useful metric in cases where False Positive is a higher concern than False Negatives.

Precision is important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to customer churn and be harmful to the business.

Recall is a useful metric in cases where False Negative trumps False Positive.

Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!

In our example, Recall would be a better metric because we don’t want to accidentally discharge an infected person and let them mix with the healthy population thereby spreading the contagious virus. Now you can understand why accuracy was a bad metric for our model.

But there will be cases where there is no clear distinction between whether Precision is more important or Recall. What should we do in those cases? We combine them!

F1-Score

In practice, when we try to increase the precision of our model, the recall goes down, and vice-versa. The F1-score captures both the trends in a single value:

F1-score is a harmonic mean of Precision and Recall, and so it gives a combined idea about these two metrics. It is maximum when Precision is equal to Recall.

But there is a catch here. The interpretability of the F1-score is poor. This means that we don’t know what our classifier is maximizing — precision or recall? So, we use it in combination with other evaluation metrics which gives us a complete picture of the result.

Confusion Matrix using scikit-learn in Python

Sklearn has two great functions: confusion_matrix() and classification_report().

Sklearn confusion_matrix() returns the values of the Confusion matrix. The output is, however, slightly different from what we have studied so far. It takes the rows as Actual values and the columns as Predicted values. The rest of the concept remains the same.

Sklearn classification_report() outputs precision, recall and f1-score for each target class. In addition to this, it also has some extra values: micro avg, macro avg, and weighted avg

Mirco average is the precision/recall/f1-score calculated for all the classes.

Macro average is the average of precision/recall/f1-score.

Weighted average is just the weighted average of precision/recall/f1-score.

Thanks for reading! If you liked this article, you can read my other articles here. If you like this article, please show your appreciation by clapping👏, & sharing this article