Exploratory data analysis is one of the most important step in any data science project. This is because, until and unless we understand the **data **we can will not be able to do any kind of task on it.

Data is not meant to fit the model, the model has to fit the data

During exploratory data analysis, we come across many facts about the data, which help us building models, these include:-

- Outliers
- Correlations
- Patterns
- Missing Data
- Ways to feature engineer

etc.

In this project we will do a run through of an EDA project based on used cars dataset.

In this project we will do the following

- Import required libraries
- Convert data types and extract features
- Treat Missing values
- Identify and treat outliers
- Decrease number of unique values (cluster)
- Encode categorical data
- Compare before after imputation value of target variable
- Bivariate analysis

# Importing Libraries

- For data wrangling's:

`import pandas as pd`

import numpy as np

- For visual representation of missing numbers

`import missingno`

- For visualizations

`import seaborn as sns`

import matplotlib.pyplot as plt

- For treating outliers

`from scipy.stats.mstats import winsorize`

# Convert data types and extract features

Before going to conversion, let us first get a view of the dataset

Features like **‘Mileage’ , ‘Engine’ **and **‘Power’** are string type and we only need the numerical number and not their unit. Another important thing to note are the units, *they are not consistent, some are **km/kg** and some *** kmpl. **We will extract these features too.

First, let us extract **Brand** and **Model** in from the feature **Name.**

*#getting brand and model of car from name*

data['Brand'] = data['Name'].map(lambda x : x.split(' ')[0])

data['Model'] = data['Name'].map(lambda x : x.split(' ')[1])

What exactly is happening? Let us go step by step:

- The
`map()`

function executes a specified function for each item in an iterable. Hence in this, all the*items of**Name**are passed through the lambda function.* - The
`lambda`

function*splits*the string on the basis of “ ” and*returns the first element and stores to**Brand.**The second element is stored to**Model.*

Next we extract the numerical value from **Engine** and **Power.**

#Converting to string and extracting first value, then converting to floatdata['Engine(cc)'] = data['Engine'].astype(str).map(lambda x : x.split(' ')[0]).replace('nan' , np.nan).astype(np.float)

data['Power(bph)'] = data['Power'].astype(str).map(lambda x : x.split(' ')[0]).replace('null' , np.nan).astype(np.float)

The syntax does the following

- The column engine is
*converted to string* - The
`lambda`

function*splits the string and return the first element* - After mapping the values,
*it replaces the**nan**with**np.nan* - The values are converted to
**np.float**

Legit question — Why are we converting nan’s to np.nan? Its because `np.nan`

allows for vectorized operations; its a float value and makes computations easier and more efficient.

Now, we will make the **Mileage** consistent:

#since mileage has two units, to keep it consistent, we will convert the km/kg to km/ldef mileage_convert(x):

if type(x) == str:

if x.split()[-1] == 'km/kg':

return float(x.split()[0])*1.40 #formula

elif x.split()[-1] == 'kmpl':

return float(x.split()[0])

else:

return x

data['Mileage(km/l)'] = data['Mileage'].apply(mileage_convert)

Formula

The formula, first requires us to find the weight of petrol (borrowed from Quora)

Density range of petrol is 0.710 to 0.775 gram/ litre at 15 deg C.Formula for conversion: Mass =Density x VolumeMass (kg) for one litre= 0.710*1= 0.710 kg or 710 gm

Next we change the units

km/kg to km/l1 kg = 0.775lkm/kg = 1km / 0.710 = 1.4084507... km/l

Hence the steps are as follows:

- Convert data to string
- Extract the unit (2nd element) and the value (1st unit), from this, we see whether the unit is
*kmpl*or*km/kg* - If the data is in
*km/kg*, we multiple it with 1.4 to change it to*km/l*

# Treating Missing Values

We first visualize the missing values to get a better view,

We can see that majority of the missing values are in ** New_Price, **to treat them, we can do the following:

- check the percentage of missing values in the columns
- If the percentage increases a certain threshold, in our case its 75, we drop the column
- For other columns, we impute them. (Median for numerical data and most frequent for categorical data)

*#see null data*

d = data.isnull().any()

*#calculating percentage of missing data and storing in a dictionary*

k = {}

for i,j **in** zip(d,data):

if i==True:

print("**{}** - **{}** % ".format(j,data[j].isnull().sum()/(data.shape[0]/100)))

*#print(Dtype(j))*

*#print(i,j)*

k[j] = data[j].isnull().sum()/(data.shape[0]/100 )

*#removing columns with more that 75% missing data*

for i **in** k:

if k[i]>=75:

data = data.drop([i],axis=1)

*#filling rest with mean *

for i **in** data:

if data[i].dtype!=object:

*#print(type(i))*

data[i].fillna(int(data[i].median()), inplace=True)

After these steps, lets see the missing value chart

# Treating outliers

To understand the outliers, we will plot boxplots of the data, and treat them accordingly. We should never remove outliers without analysing them.

Let us go data wise:

## Year

If we see the minimum value of Year, its **1998**, which is not “impossible”, that is why, we wont do any changes to this.

## Kilometers Driven

We can see that we do have outliers here, the maximum value was way above the normal km’s of normal car.

Therefore, we needed to treat it. I have used Winsorization to do so, the article below explains the details of it.

Detecting and Treating Outliers In Python — Part 3 | by Alicia Horsch | Towards Data Science

`data['Kilometers_Driven'] = winsorize(data['Kilometers_Driven'].values, limits = [0, 0.01]).data`

We are putting threshold on k=1 on the *right tail*.

To understand the full working, please see this notebook.

In the same way, we will check and treat other columns.

# Decreasing number of unique values

There are two columns in which we can decrease the number of unique entity:

- Name
- Location

From Name, we can cluster the cars as their types, which is

- Hatchback
- Sedan
- SUV
- MUV
- Crossover
- Coupe
- Convertible

And for **Location,** we can cluster them as regions.

`def extract_region(x):`

if x == 'Delhi' **or** x == 'Jaipur':

return 'North'

elif x == 'Bangalore' **or** x == 'Chennai' **or** x == 'Coimbatore' **or** x == 'Hyderabad' **or** x == 'Kochi':

return 'South'

elif x == 'Kolkata':

return 'East'

if x == 'Mumbai' **or** x == 'Pune' **or** x == 'Ahmedabad':

return 'West'

data['Region'] = data['Location'].apply(extract_region)

# Encode Categorical Data

To encode the categorical data, there are two ways:

- Using labels
- One_hot_encoding

To attach labels to each type, we can do the following:

*#first we will see the unique values in each of the variables*

data.Fuel_Type.unique()

The output:

And then, we use *apply* and mapping to give labels.

*#encoding*

def encode_fuel(x):

if x == 'CNG':

return 1

elif x == 'Diesel':

return 2

elif x == 'Petrol':

return 3

elif x == 'LPG':

return 4

else:

return 5

data['encoded_Fuel_Type'] = data['Fuel_Type'].apply(encode_fuel)

To get one hot encoding, we use ** pd.get_dummies** :

`pd.get_dummies(data['Fuel_Type'])`

In this, the “1” refers to value the row has, the other features are “0”.

# Compare before after imputation value of target variable

The target variable, “**New_price**” had more than 75% of its data missing, hence we dropped it. But, for training a model, we will need that column. To impute a column, its better to use the ** median** because it not effected by the extreme values.

Let us compare the histogram of before and after:

Before imputation, we can see that the plot is positively skewed, with quite a number of value bins:

After imputation, the distribution of the plot remains the same, but the number of bins have changed considerably. The values that were at 200+ also seem to have diminished.

# Bivariate Analysis

This is the most interesting part of eda, this is the place where we see the graphs and infer data from it, lets see a few:

## Location and Price:

- Coimbatore and Bangalore are forward, which will effect the region wise price too.
- Kolkata, Kochi and Pune are almost of the same line.

What can be the reason, all these states are metros, then why is there such unevenness in price?

## KM vs Price

- Faint correlation
- Price of less driven cars are more (Duh!)
- One of the most driven car is highly priced, why? maybe because that is an high end car.

- Correlation can be seen
- Why is there such crashes and peaks? is it because of the models?
- Do models and company matter more than anything?

## Model and Price

- Lamborghini is highest (way too high), but other high end companies like Audi and Porsche are comparatively lower.

Isn’t is interesting!

Notebook — eda and visualizations | Kaggle

If you have been with me so far, please give an upvote :)

Resources

REFERENCES

1. https://philkotse.com/car-buying-and-selling/just-how-many-maximum-kilometers-can-a-car-do-6254

2. https://www.statisticshowto.com/winsorize/

3. https://kegui.medium.com/what-is-the-difference-between-nan-none-pd-nan-and-np-nan-a8ee0532e2eb

4. https://www.quora.com/What-is-the-weight-of-1-litre-petrol

5. https://www.aqua-calc.com/page/density-table/substance/petrol

6. https://www.inchcalculator.com/convert/kilogram-to-liter/

See you soon.