Oversampling and under sampling

Sairam Penjarla
3 min readJun 8, 2022

Today we are going to take a look at an important technique when dealing with unbalenced datasets. Unbalanced datasets usually have a target feature that has very high number of samples for one particular class and very low samples for the other classes.

This makes it extremely difficult for the model to understand the data and make predictions. In this article, we are using the Porto Seguro’s Safe Driver Prediction dataset (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/data) to understand how to tackle with this problem.

First let us load the data and look at the margin of difference between the classes.

in the above image you can observe that there are a total of 573518 samples in one class and only 21694 in the other class.

If we train a ML model with this dataset, The model will be highly bias towards the major class. Hence, By simpling predicting the major class every single time, model can acheive above 90% accuracy. Thus, metrics such as accuracy or confusion matrics wouldn’t give us an accurate measure of the model’s robustness.

In the above confusion metrics you can observe that the model has predicted only the first class 100% of the time.

Let us understand how we can tackle this problem. There are two ways to deal with this. One is to reduce the samples in the major class and the other is to increase the sample size in the minor class.

Both of these solutions come with drawback of overfitting. Especially if your minor class has very less samples to begin with. Say less than 1000. This can lead to overfitting.

In both under and over sampling, we must use a technique called as tomek links. Tomek links are those samples that have the exact same features but of the opposite target class. These can induce noise in our data and hence must be removed.

SMOTE

let us understand a new technique called SMOTE. Using SMOTE, we can perform bth under and over sampling together. That means we will be decreasing the samples of major class and also increasing the samples in minor class.

Pair that along with tomek links, we can acheive a proper dataset suitable for ML and DL training.

In the above image we can see that the model is now predicting the second class as well. Although this is a terrible model as of now, We can confirm that the data is now ready to be worked with.

--

--

Sairam Penjarla

Looking for my next opportunity to make change in a BIG way