We all know what outliers are “values different from others” or “IMPOSSIBLE values”. But how do we deal with them?
Having outliers can mess with our analysis. And, removing all outliers can also mess with our analysis. (I know, how weird). Let me show you, this is something I encountered on my very first EDA (><)
This is a boxplot of the price distribution of different wines from various countries. I applied the IQR and dropped the outliers and got the pretty boxplot.
While doing the EDA, I tried to find the price distribution according to countries.
WHAT?? Only The US!
I know, I know its messy and blah blah but it looks legit.
So see, how nicely outliers can mess with data if they are just dropped.
What to do?
- Use common sense.
Really, just think. Can the price of wine be $3300 (think about France)? Yes, why not? If we do some cross-checking and research, we will get the answer. I encountered a similar thing on the audible catalogue[3].
2. Use the trial method
When you plot a graph and feel that it’s skewed, you can make a temporary data-frame and do the analysis.
trial = audiob_adv[~(audiob_adv['Price']>3000)]
3. Drop the physically impossible values
If the value is physically impossible, then drop ’em. As in the boxplot above of rating, we can see the minimum at -1, which is not OK.
4. Use Statistical tools
Use various tests to understand and remove the outliers like Pierce’s Criterion, Chauvenet’s Criterion, Grubb’s test for outliers, Dixon Q’s test, and others.
5. Automatic outlier detection (in Python)
This article here deals with the 4 automatic outlier detection algorithms like Isolation Forest.
Further Reading:
1. https://humansofdata.atlan.com/2018/03/when-delete-outliers-dataset/
2. https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
3. https://www.kaggle.com/amritvirsinghx/audible-complete-catalog
My kaggle notebook:
https://www.kaggle.com/ipshitagh/wine-reviews-eda-and-recommender-system