Outlier Detection and Treatment:

Vivek Rai
8 min readDec 29, 2020

--

WHAT IS AN OUTLIER ?

Outlier is an observation that is numerically distant from the rest of the data or in a simple word it is the value which is out of the range.let’s take an example to check what happens to a data set with and data set without outliers.

As you can see, data set with outliers has significantly different mean and standard deviation. In the first scenario, we will say that average is 3.14. But with the outlier, average soars to 59.71. This would change the estimate completely.

Lets take a real world example. In a company of 50 employees, 45 people having monthly salary of Rs.6,000, 5 senior employees having monthly salary of Rs.100000 each. If you calculate the average monthly salary of employees in the company is Rs.14,500, which will give you the wrong conclusion (mejority of employees have lesser than 14.5k salary). But if you take median salary, it is Rs.6000 which is more sense than the average.For this reason median is appropriate measure than mean. Here you can see the effect of outlier.

CAUSE FOR OUTLIERS

  • Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
  • Data Entry Errors:- Human errors such as errors caused during data collection, recording, or entry can cause outliers in data.
  • Measurement Error:- It is the most common source of outliers. This is caused when the measurement instrument used turns out to be faulty.
  • Natural Outlier:- When an outlier is not artificial (due to error), it is a natural outlier. Most of real world data belong to this category.

OUTLIER DETECTION

Outlier can be of two types: Univariate and Multivariate. Above, we have discussed the example of univariate outlier. These outliers can be found when we look at distribution of a single variable. Multi-variate outliers are outliers in an n-dimensional space.

DIFFERENT OUTLIER DETECTION TECHNIQUE.

  1. Hypothesis Testing
  2. Z-score method
  3. Robust Z-score
  4. I.Q.R method
  5. Winsorization method(Percentile Capping)
  6. DBSCAN Clustering
  7. Isolation Forest
  8. Visualizing the data

1. HYPOTHESIS TESTING(GRUBBS TEST)

If the calculated value is greater than critical, you can reject the null hypothesis and conclude that one of the values is an outliers

2. Z-SCORE METHOD

Using Z score method,we can find out how many standard deviations value away from the mean.

Figure in the left shows area under normal curve and how much area that standard deviation covers.

  • 68% of the data points lie between + or -1 standard deviation.
  • 95% of the data points lie between + or -2 standard deviation
  • 99.7% of the data points lie between + or -3 standard deviation

If the z score of a data point is more than 3 (because it cover 99.7% of area), it indicates that the data value is quite different from the other values. It is taken as outliers.

If the z score of a data point is more than 3 (because it cover 99.7% of area), it indicates that the data value is quite different from the other values. It is taken as outliers.

3. ROBUST Z-SCORE

It is also called as Median absolute deviation method. It is similar to Z-score method with some changes in parameters. Since mean and standard deviations are heavily influenced by outliers, alter to this parameters we use median and absolute deviation from median.

Robust Z-score formula

Where MAD = median(|X-median|)

Suppose x follows a standard normal distribution. The MAD will converge to the median of the half normal distribution, which is the 75% percentile of a normal distribution, and N(0.75)≃0.6745.

4. IQR METHOD

In this method by using Inter Quartile Range(IQR), we detect outliers. IQR tells us the variation in the data set.Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR treated as outliers

  • Q1 represents the 1st quartile/25th percentile of the data.
  • Q2 represents the 2nd quartile/median/50th percentile of the data.
  • Q3 represents the 3rd quartile/75th percentile of the data.
  • (Q1–1.5IQR) represent the smallest value in the data set and (Q3+1.5IQR) represnt the largest value in the data set.

5. WINSORIZATION METHOD(PERCENTILE CAPPING)

This method is similar to IQR method. If a value exceeds the value of the 99th percentile and below the 1st percentile of given values are treated as outliers.

6. DBSCAN (DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE)

DBSCAN is a density based clustering algorithm that divides a dataset into subgroups of high density regions and identifies high density regions cluster as outliers. Here cluster -1 indicates that the cluster contains outlier and rest of clusters have no outliers. This approch is similar to the K-mean clustering. There are two parameters required for DBSCAN. DBSCAN give best result for multivariate outlier detection.

  1. epsilon: a distance parameter that defines the radius to search for nearby neighbors.
  2. minimum amount of points required to form a cluster.

Using epsilon and minPts, we can classify each data point as:

Core point –> a point that has at least a minimum number of other points (minPts) within its radius. Border point –> a point is within the radius of a core point but has less than the minimum number of other points (minPts) within its own radius. Noise point –> a point that is neither a core point or a border point

7. ISOLATION FOREST

It is an clustering algorithm that belongs to the ensemble decision trees family and is similar in principle to Random Forest.

  1. It classify the data point to outlier and not outliers and works great with very high dimensional data.
  2. It works based on decision tree and it isolate the outliers.
  3. If the result is -1, it means that this specific data point is an outlier. If the result is 1, then it means that the data point is not an outlier.

8. VISUALIZING THE DATA

Data visualization is useful for data cleaning, exploring data, detecting outliers and unusual groups, identifying trends and clusters etc. Here the list of data visualization plots to spot the outliers.

  1. Box and whisker plot (box plot).
  2. Scatter plot.
  3. Histogram.
  4. Distribution Plot.
  5. QQ plot.

WHAT NEXT??

After detecting the outlier we should remove\treat the outlier

  • Outliers badly affect mean and standard deviation of the dataset. These may statistically give erroneous results.
  • It increases the error variance and reduces the power of statistical tests.
  • If the outliers are non-randomly distributed, they can decrease normality.
  • Most machine learning algorithms do not work well in the presence of outlier. So it is desirable to detect and remove outliers.
  • They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

With all these reasons we must be careful about outlier and treat them before build a statistical/machine learning model. There are some techniques used to deal with outliers.

  1. Deleting observations.
  2. Transforming values.
  3. Imputation.

DELETING OBSERVATIONS:

We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers. But deleting the observation is not a good idea when we have small dataset.

TRANSFORMING VALUES:

Transforming variables can also eliminate outliers. These transformed values reduces the variation caused by extreme values.

  1. Scalling
  2. Log transformation
  3. Cube Root Normalization
  4. Box-Cox transformation
  • These techniques convert values in the dataset to smaller values.
  • If the data has to many extreme values or skewed, this method helps to make your data normal.
  • But These technique not always give you the best results.
  • There is no lose of data from these methods.
  • In all these method boxcox transformation gives the best result.

IMPUTATION

Like imputation of missing values, we can also impute outliers. We can use mean, median, zero value in this methods. Since we imputing there is no loss of data. Here median is appropriate because it is not affected by outliers.

CONCLUSION

  1. Median is best measure of central tendency when the data has outlier or skewed.
  2. Winsorization Method or Percentile Capping is the better outlier detection technique the others.
  3. Median imputation completely remove outlier.

Outlier is one of the major problem in machine learning. If you neglect the outlier result with bad performance of the model. In this kernel I’m try to cover almost all the topics related to outliers, outlier detection, outlier treatment techniques.

--

--

Vivek Rai

A passionate data scientist having knowledge in predictive modelling, data processing, and data mining algorithms to solve challenging business problems.