Evaluation metric for Supervised Learning

Evaluation metric for Supervised Learning:

Evaluation metrics explain the performance of a model. An important aspect of evaluation metrics is their capability to discriminate among model results.

In machine learning, we regularly deal with mainly two types of tasks that are classification and regression. Classification is a task where the predictive models are trained in a way that they are capable of classifying data into different classes for example if we have to build a model that can classify whether a loan applicant will default or not. But regression is a process where the models are built to predict a continuous variable for example if we need to predict the house prices for the upcoming year.

In both the tasks we do the basic data processing followed by splitting the data into training and testing sets. We use training data to train the model whereas testing data is used to compute prediction by the model. Many different algorithms can be used for classification as well as regression problems but the idea is to choose that algorithm that works effectively on our data. This can be done by doing the evaluation of the model and using error metrics. Different evaluation methods are used like confusion matrix, accuracy score, classification report, mean square error etc.

We have Different evaluation metrics for Supervised Learning. Here is the list of the evaluation metrics,

Evaluation metrics for Regression:

  1. Mean Absolute Error (MAE)
  2. Mean Square Error (MSE)
  3. Root Mean Square Error (RMSE)
  4. Root Mean Square Log Error (RMSLE)
  5. R2 and Adjusted R2

Evaluation metrics for Classification:

  1. Confusion Matrix
  2. Accuracy
  3. Alternatives to Accuracy
  4. Recall (TPR, Sensitivity)
  5. Precision
  6. F-Score
  7. ROC AUC
  8. FPR (Type I Error)
  9. FNR (Type II Error)
  10. Log Loss
  11. Gini Coefficient

Now will discuss on Regression metrics:

1. Mean Absolute Error (MAE)

  • MAE is the measure of the difference between the two continuous variables. The MAE is the average vertical distance between each actual value and the line that best matches the data. MAE is also the average horizontal distance between each data point and the best matching line.
Mean Absolute Error

2. Mean Square Error (MSE)

Mean Squared Error that is mainly used when predictions have large deviations. Values range from 0 up to millions and we don’t want to punish deviations in prediction.

Mean Square Error (MSE) measures how far the data are from the model’s predicted values.

Disadvantage of MSE:

  1. Sensitive to outliers
  2. If we make a single very bad prediction, taking the square will make the error even worse and it may skew the metric towards overestimating the model’s badness.
  3. On the other hand, if all the errors are smaller than 1, than it affects in the opposite direction: we may underestimate the model’s badness.
Mean Squared Error

3. Root Mean Square Error (RMSE)

  • The most commonly used metric for regression tasks is RMSE (root-mean-square error). This is defined as the square root of the average squared distance between the actual score and the predicted score.
  • RMSE is sensitive to outliers and can exaggerate results if there are outliers in the data set.
Root Mean Square Error (RMSE)

4. Root Mean Square Log Error (RMSLE)

  • we don’t want to penalize big differences when both the predicted and the actual are big numbers.
  • we want to penalize under estimates more than over estimates.
  • Range: [0,∞)
  • Squared Logarithmic Error(SLE) = (log(prediction+1)-log(actual+1))²
  • RMSLE = sqrt(mean(squared logarithmic errors))


  • n is the total number of observations
  • p is the predicted value
  • a is the actual value
  • 1 is added as constant to actual and predicted values because they can be 0 and log of 0 is undefined.

RMSLE measures the ratio between actual and predicted.


can be written as log((pi+1)/(ai+1))




  • Imagine that we have a simple predictive model, for example, a linear regression that predicts the following values.
  • The metrics for these values would be:
  • MRSE: 2.5495
  • MRSLE: 0.5358


  • One difference is the influence that outliers values have on the error. This happens because when the values are transformed to logarithmic, these values are softer and also the error. This is known as robustness.
  • We will calculate the metrics by adding one outlier observation in the table above.
  • If we look at the metrics again, we can see that the RMSE is very affected because it has increased a lot due to the new values that have been added.
  • RMSE: 28.9421
  • RMSLE: 0.5890
  • Also, visually this effect on a graph can be understood because the logarithmic representation is not parallel, since, according to its orientation it has one of the sides with a flatter curve, so it penalizes more underestimation than overestimation.

5. R Squared (R2)

  • R2 measures how far the data are from the model’s predicted values compare to how far the data are from the mean model (model predicting all given samples as mean value).
  • always has values between -∞ and 1.
  • When the interest is in the relationship between variables, not in prediction, the R2 is less important.
R Squared Error
from sklearn.metrics import r2_score
r2 = r2_score(y_test,y_pred)

6. Adjusted R Squared

adj_r2_score = 1 - ((1-r2)*(n-1)/(n-k-1))


  • If you have outlier in the data and you want to ignore them, MAE is a better option but if you want to account for them in your loss function, go for MSE/RMSE.
  • When the interest is in the relationship between variables, not in prediction, the R2 is less important.

Now will discuss on Classification metrics:

1. Confusion Matrix

  • Inorder to check the accuracy like how many no of results are got correctly, usualy create a 2x2 matrix and is called Confusion Matrix for 2 class labels. No of outputs is either 1 or 0.
  • A confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a classifier. It can be used to evaluate the performance of a classification model through the calculation of performance metrics like accuracy, precision, recall, and F1-score.
Python code for Confusion Matrix for Heart disease prediction
  • False Positive 21
  • False Negative 2
  • True Negative 4
  • True Positive 34
Confusion Matrix Heatmap

2. Accuracy

Accuracy simply measures how often the classifier makes the correct prediction. It’s the ratio between the number of correct predictions and the total number of predictions While accuracy is easy to understand, the accuracy metric is not suited for unbalanced classes. Hence, we also need to explore other metrics for classification.

There are 4 important terms:

a. TP(True Positive): The cases in which,

Actual value is 1
Predicted value is 1

b. FN(False Negative)

Actual value is 1
Predicted value is 0

c. FP(False Positive)

Actual value is 0
Predicted value is 1

d. TN(True Negative)

Actual value is 0
Predicted value is 0.

3. Alternatives to Accuracy

4,5. Precision and Recall:

  1. Balanced Dataset
    Accuracy = (TP+TN) / (TP+FP+FN+TN)
  2. Imbalanced Dataset


TPR : True Positive Rate


suppose person having cancer (or) not? He is suffering from cancer but model predicted as not suffering from cancer


PRECISION (+ve prediction value). Out of the total actual positive predicted results how many were actually positive.

In Spam Detecion : Need to focus on precision

a. Suppose mail is not a spam but model is predicted as spam : FP (False Positive). We always try to reduce FP.

b. Whenever False Positive is much more important use PRECISION

c. Whenever False Negotive is much more important use RECALL

6. F Score

For a use case, if we are trying get the best precision and recall at the same time? F Score is the harmonic mean of precision and recall values for a classification problem.

7. ROC / AUC

AUC-ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability.

Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

a. AUC : Area Under Curve

One of the widely used metrics for binary classification is the Area Under Curve(AUC) AUC represents the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. The AUC is based on a plot of the false positive rate vs the true positive rate which are defined as:

Defining terms used in AUC and ROC Curve

1. TPR (True Positive Rate) / Recall / Sensitivity

Sensitivity tells us what proportion of the positive class got correctly classified. A simple example would be to determine what proportion of the actual sick people were correctly detected by the model.

2. Specificity / TNR (True Negative Rate)

Specificity tells us what proportion of the negative class got correctly classified. Taking the same example as in Sensitivity, Specificity would mean determining the proportion of healthy people who were correctly identified by the model.

3. FPR (False Positive Rate)

FPR tells us what proportion of the negative class got incorrectly classified by the classifier. A higher TNR and a lower FPR is desirable since we want to correctly classify the negative class.

4. FNR (False Negative Rate)

False Negative Rate (FNR) tells us what proportion of the positive class got incorrectly classified by the classifier. A higher TPR and a lower FNR is desirable since we want to correctly classify the positive class.

The area under the curve represents the area under the curve when the false positive rate is plotted against the True positive rate as below.

AUC ranges between 0 and 1.

A value of 0 means 100% prediction of the model is incorrect. A value of 1 means that 100% prediction of the model is correct.

ROC Curve

b. ROC : Receiver Operating Characteristic Curve

The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.


Relation between Sensitivity, Specificity, FPR and Threshold

Sensitivity and Specificity are inversely proportional to each other. So when we increase Sensitivity, Specificity decreases and vice versa.

When we decrease the threshold, we get more positive values thus it increases the sensitivity and decreasing the specificity.

Similarly, when we increase the threshold, we get more negative values thus we get higher specificity and lower sensitivity.

As we know FPR is 1-specificity. So when we increase TPR, FPR also increases and vice versa.

10. Log Loss

Log loss is a pretty good evaluation metric for binary classifiers and it is sometimes the optimization objective as well in case of Logistic regression and Neural Networks.

Binary Log loss for an example is given by the below formula where p is the probability of predicting 1.

Predicted Probability

As you can see the log loss decreases as we are fairly certain in our prediction of 1 and the true label is 1.

11. Gini Coefficient

- Gini coefficient is sometimes used in classification problems. Gini coefficient can be straight away derived from the AUC ROC number. Gini is nothing but the ratio between the area between the ROC curve and the diagonal line and the area of the above triangle.

- The formula for Gini Coefficient,
Gini = 2*AUC-1

- Gini above 60% is a good model. An important point to note is that this Gini coefficient is different from the Gini index we encounter in Decision tree.


We have discussed the evaluation metrics for both the classification and regression problems. We can always try improving the model performance using a good amount of feature engineering and Hyperparameter Tuning. Read more about error metrics here

Top 10 Model evaluation Machine Learning Enthusiast should know

Clap if you liked the article!

Please find my next articles on:

1. “What is a confusion matrix?”

2. What is the AUC — ROC Curve?

Working as Automotive design engineer. Actively looking for change the domain into Data Science. Certified from Simplilearn as “Data Scientist”.