Failure mode analysis of machine learning models is a critical step in understanding the regime in which a model can be applied. I find Receiving Operator Characteristic (ROC) Curves particularly useful in this process of selecting the most viable model to apply to a given problem.
ROC curves are most useful in a binary classification (or multi-label classification) scenario. They help us understand how the entries of our confusion matrix change as we vary a decision threshold for our positive class.
If you're unfamiliar with confusion matrices, we'll first walk through an example and talk over the metrics that a ROC curve conveys. Below is a sample confusion matrix for a hypothetical computer vision model trained to detect dogs in natural images.
Confusion Matrices
Actual class | |||
---|---|---|---|
Dog | Not Dog | ||
Predicted
class | Dog | 5 | 2 |
Not Dog | 3 | 3 |
Although this first example has concrete numbers for each of the four entries in the table, we often refer to these four entries by name – True Positive, False Positive, False Negative, and True Negative.
Actual class | |||
---|---|---|---|
Dog | Not Dog | ||
Predicted
class | Dog | True Positive | False Positive |
Not Dog | False Negative | True Negative |
Common Metrics for Binary Classification Problems
Many common metrics you may have used in the past (precision and recall, for instance) can be expressed as simple ratios of the terms in this binary confusion matrix.
Precision
Precision is the percentage of predicted items that were relevant. Expressed as a ratio of our confusion matrix entries, this is:
True Positives / (True Positives + False Positives)
For our toy dog classifier, our model correctly identified 5 images as containing dogs (True Positives), but also incorrectly predicted "dog" for 2 images without a dog (False Positives). This gives us a precision of 5/(5+2), or ~71%.
We can represent this visually by coloring our confusion matrix. We'll can visualize this ratio by coloring in the relevant entries in our confusion matrix.
Actual class | |||
---|---|---|---|
Dog | Not Dog | ||
Predicted
class | Dog | True Positive | False Positive |
Not Dog | False Negative | True Negative |
Recall
Similarly, Recall is the percentage of relevant items that were predicted. Recall is also known as the True Positive Rate (TPR). Expressed as a ratio of our confusion matrix entries, this is:
True Positives / (True Positives + False Negatives)
Of the images containing dogs, our model correctly predicted 5, but failed to catch 3 additional images that contained dogs. This equates to a recall of 5/(5+3), or 62.5%.
Actual class | |||
---|---|---|---|
Dog | Not Dog | ||
Predicted
class | Dog | True Positive | False Positive |
Not Dog | False Negative | True Negative |
False Positive Rate (FPR)
ROC Curves make use of a third useful metric – known as the False Positive Rate (FPR) or False Alarm Rate. The False Positive Rate represents the percentage of negative examples that a classifier incorrectly predicts as positive.
As a ratio, this is:
False Positive / (False Positive + True Negative)
Of the images that did not contain dogs, our model accidentally classified 2 as containing a dog, while it correctly identified that an addition 3 did not, for a False Positive Rate of 2/(2+3) or 40%.
Actual class | |||
---|---|---|---|
Dog | Not Dog | ||
Predicted
class | Dog | True Positive | False Positive |
Not Dog | False Negative | True Negative |
ROC Curves
Now that you have intuition for Recall and the False Positive Rate, we have all the necessary building blocks to start looking at ROC Curves. ROC Curves plot how the True Positive Rate (Recall) varies with respect to the False Positive Rate.
Below is an animation that illustrates this relationship. We start with our decision threshold set to 1.0 and sweep from 1.0 to 0.0. The confidences of our positive class are indicated in blue, while the confidences of our negative class are displayed in green. Everything to the right of this decision threshold (indicated by the red line in the animation below) is considered a "positive prediction". Everything to the left of threshold is correspondingly considered a "negative prediction". Points from both classes have been jittered in the top plot to make sure you can get a good sense of the two class distributions.
At the beginning of the animation, no examples exceed our threshold, so we have no True Positives and no False Positives, and plot our (False Positive Rate, True Positive Rate) at (0, 0). As we gradually move our decision threshold, we start to pick up both True Positives and False Positives. Every time we cross over a positive example, our line moves upward, indicating that our True Positive Rate has increased because we've detected a new example from our positive class. Every time we cross over a negative example, our line moves to the right, indicating that our False Positive Rate has increased.
Typical ROC Curve
In this sense, the ROC curve is a measure of the separability of the class distributions, which you can see quite directly in the upper plot! In a perfect world, our model is able to perfectly separate the positive examples from the negative examples. This would produce a (quite boring) ROC curve like we see below.
Perfect Model
For a particularly tough classification task, our model may not be able to find any means to separate the two distributions. In this case, we'd get an ROC curve that looks roughly like a straight line from (0, 0) to (1, 1). This tells us that the distributions of our two classes are entirely overlapping.
Poor Quality Model
Area Under the Curve of the Receiving Operator Characteristic (ROCAUC)
If you're trying to select the best model from a cross validation sweep, we can use the ROC Curve to derive another metric that's useful for model selection. ROCAUC measures the area under the receiving operator characteristic curves.
In the case of a perfect model, our ROCAUC is 1.0. A model that performs no better than random guessing scores a 0.5.
It has the nice property that high confidence False Positive predictions impact our metric more than low confidence False Positive predictions, and gives us a single metric that describes the separability of our positive and negative class distributions. Especially for problems that involve detecting rare behavior, ROCAUC may be a useful metric to monitor.
Even if you're familiar with ROC curves as a means to introspect model performance, I hope these visuals are useful in building some intuition for how to interpret ROC plots.