Skewness And Kurtosis In Machine Learning

8 min readJan 7, 2021

WHY DO WE CARE SO MUCH ABOUT NORAMLITY ?

Most of the paramteric machine learning models like LDA, Linear Regression and mamy more assume that the data is normally distributed. If this assumption fails the model fails to give accurate predictions.

WHAT IS NORMAL DISTRIBUTION ?

A probability distribution with the mean 0 and standard deviation of 1 is known as standard normal distribution or Gaussian distribution. A normal distibution is symmetric about the mean and follows a bell shaped curve . And almost 99.7% of the values lies within 3 standard deviation. The mean, median and mode of a normal distribution are equal.

SKEWNESS

What is skewness?

Skewness of a distribution is defined as the lack of symmetry. In a symmetrical distribution, the Mean, Meadian and Mode are equal.The normal distribution has a skewness of 0.

Skewness tell us about distribution of our data.

Skewness is of two types

Positive skewness: When the tail on the right side of the distribution is longer or fatter, we say the data is positively skewed. For a positive skewness mean > median > mode.
Negative skewness:: When the tail on the left side of the distribution is longer or fatter, we say that the distribution is negatively skewed. For a negative skewness mean < median < mode.

What does skewness tells us?

To understand this better consider a example.

Consider house prices ranging from 100k to 1,000,000 with the average being 500,000.

If the peak of the distribution is in left side that means our data is positively skewed and most of the houses are being sold at the price less than the average.

If the peak of the distribution is in right side that means our data is negatively skewed and most of the houses are being sold at the price greater than the average.

Now, the question is when we can say our data is mderately skewed or heavily skewed?

The thumb rule is: If the skewness is between -0.5 to +0.5 then we can say data is fairly symmetrical. If the skewness is between -1 to -0.5 or 0.5 to 1 then data is moderately skewed. And if the skewness is less than -1 and greater than +1 then our data is heavily skewed.

KURTOSIS

You might have heard that kurtosis tells us about the shape or peakedness or flatness of the distribution but this is not correct. Kurtosis tell us about the tails behaviour. It is actually the measure of outliers present in the distribution.

Kurtosis are of three types:

Mesokurtic: When the tails of the distibution is similar to the normal distribution then it is mesokurtic. The kutosis for normal distibution is 3.
Leptokurtic: If the kurtosis is greater than 3 then it is leptokurtic. In this case, the tails will be heaviour than the normal distribution which means lots of outliers are present in the data. It can be recognized as thin bell shaped distribution with peak higher than normal distribution.
Platykurtic: Kurtosis will be less than 3 which implies thinner tail or lack of outliers than normal distribution.In case of platykurtic, bell shaped distribution will be broader and peak will be lower than the mesokurtic.

Tranformations to reduce skewness or kurtosis

Let’s take an example to handle the skewness and kutosis in a dataset. For this, I have used the ‘House Pricing’ data.

Let’s check the distribution of the “SalePrice”

Here we can see that Mean (180921) is greater than the median(163000) and the maximum is 3.5 times the 75%. (The distribution is positvely skewed).

We can say that most of the house prices are below the average.

Let’s plot and check

The histogram confirm that our dataset is positively skewed.

Now let’s check the measure of skewness and kurtosis

Here, skew of raw data is positive and greater than 1,and kurtosis is greater than 3, right tail of the data is skewed. So, our data in this case is positively skewed and lyptokurtic.

Note- If we are keeping ‘fisher=True’, then kurtosis of normal distibution will be 0. Similarly, kurtosis >0 will be leptokurtic and kurtosis < 0 will be platykurtic

Common transformation method to handle skewed data are

log transformations
square root transformation
Cube root transformation
Box-cox transformation

Let’s look at the effect of all these transformation on our dataset.

1. Log Transformation

Logarithm is defined only for positive values so we can’t apply log transformation on 0 and negative numbers.

Logarithmic transformation is a convenient means of transforming a highly skewed variable into a more normalized dataset. When modeling variables with non-linear relationships, the chances of producing errors may also be skewed negatively.Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.

Why log?

The normal distribution is widely used in basic research studies to model continuous outcomes. Unfortunately, the symmetric bell-shaped distribution often does not adequately describe the observed data from research projects. Quite often data arising in real studies are so skewed that standard statistical analyses of these data yield invalid results.

Many methods have been developed to test the normality assumption of observed data. When the distribution of the continuous data is non-normal, transformations of data are applied to make the data as “normal” as possible and, thus, increase the validity of the associated statistical analyses.

Popular use of the log transformation is to reduce the variability of data, especially in data sets that include outlying observations. Again, contrary to this popular belief, log transformation can often increase — not reduce — the variability of data whether or not there are outliers.

Why not?

Using transformations in general and log transformation in particular can be quite problematic. If such an approach is used, the researcher must be mindful about its limitations, particularly when interpreting the relevance of the analysis of transformed data for the hypothesis of interest about the original data.

Now if you look the distribution it is close to normal distribution. We have also reduced the skewness and the kurtosis.

Let’s apply a linear regression model and check how well model performs before and after we apply log transformation.

We can notice that it has a cone shape where the data points essentially scatter off as we increase in GrLivArea.

Check model summary

Apply Log on GrLivArea

Apply Linear Regression on Log transformed SalePrice and GrLivArea

We can now see the relationship as a percent change. By applying the logarithm to the variables, there is a much more distinguished and or adjusted linear regression line through the base of the data points, resulting in a better prediction model.

After Applying Log transformation R-squared has increased from 0.50 to 0.53.

2. Square Root Transformation

The square root means x to x^(1/2) = sqrt(x), is a transformation with a moderate effect on distribution shape. it is weaker than the logarithm and the cube root.

It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values.

Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small.

Note: The square root tranformation has reduced the skewed values from 1.88 to 0.943 which is very much nearer to zero compare to 1.88

3. Cube root Transformation

The cube root means x to x^(1/3). This is a fairly strong transformation with a substantial effect on distribution shape,

It is weaker than the logarithm but stronger than the square root transformation.

It is also used for reducing right skewness, and has the advantage that it can be applied to zero and negative values. Note that the cube root of a volume has the units of a length. It is commonly applied to rainfall data.

Note: In the square root tranformation it has reduced the skewed values from 1.88 to 0.943 but now in cube root transformation the skewed values reduced to 0.66. and it is very much near to zero compare to 0.943.

Box-Cox Transformation

The Box-Cox transformation is a particulary useful family of transformations. It is defined as:

T(Y)=(Y^λ−1)/λ where Y is the response variable and λ is the transformation parameter. For λ = 0, the natural log of the data is taken instead of using the above formula.

At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.

Below is the list of different lanbda values we can consider while doing box-cox transformation.

Lambda= -2 → Y’=1/Y2

Lambda= -1 → Y’=1/Y

Lambda= -0.5 → Y’=1/SQRT(Y)

Lambda= 0 → Y’=LOG(Y)

Lambda= 0.5 → Y’=SQRT(Y)

Lambda= 1 → Y’=Y

Lambda= 2 → Y’=Y2

Box-cox transformation only works if all the data is positive and greater than 0. In case of negative data, we can add a constant value to make it positive before applying box-cox transformation.

Now if we look the SalePrice is close to normal distribution.

Thanks for reading! If you liked this article, you can read my other articles here. If you like this article, please show your appreciation by clapping👏, & sharing this article