Run through of EDA and Feature engineering

Ipshita
8 min readAug 17, 2021

Exploratory data analysis is one of the most important step in any data science project. This is because, until and unless we understand the data we can will not be able to do any kind of task on it.

Data is not meant to fit the model, the model has to fit the data

During exploratory data analysis, we come across many facts about the data, which help us building models, these include:-

  1. Outliers
  2. Correlations
  3. Patterns
  4. Missing Data
  5. Ways to feature engineer

etc.

In this project we will do a run through of an EDA project based on used cars dataset.

In this project we will do the following

  1. Import required libraries
  2. Convert data types and extract features
  3. Treat Missing values
  4. Identify and treat outliers
  5. Decrease number of unique values (cluster)
  6. Encode categorical data
  7. Compare before after imputation value of target variable
  8. Bivariate analysis

Importing Libraries

  • For data wrangling's:
import pandas as pd
import numpy as np
  • For visual representation of missing numbers
import missingno
  • For visualizations
import seaborn as sns
import matplotlib.pyplot as plt
  • For treating outliers
from scipy.stats.mstats import winsorize

Convert data types and extract features

Before going to conversion, let us first get a view of the dataset

glimpse of dataset

Features like ‘Mileage’ , ‘Engine’ and ‘Power’ are string type and we only need the numerical number and not their unit. Another important thing to note are the units, they are not consistent, some are km/kg and some kmpl. We will extract these features too.

First, let us extract Brand and Model in from the feature Name.

#getting brand and model of car from name
data['Brand'] = data['Name'].map(lambda x : x.split(' ')[0])
data['Model'] = data['Name'].map(lambda x : x.split(' ')[1])

What exactly is happening? Let us go step by step:

  1. The map() function executes a specified function for each item in an iterable. Hence in this, all the items of Name are passed through the lambda function.
  2. The lambda function splits the string on the basis of “ ” and returns the first element and stores to Brand. The second element is stored to Model.

Next we extract the numerical value from Engine and Power.

#Converting to string and extracting first value, then converting to floatdata['Engine(cc)'] = data['Engine'].astype(str).map(lambda x : x.split(' ')[0]).replace('nan' , np.nan).astype(np.float)
data['Power(bph)'] = data['Power'].astype(str).map(lambda x : x.split(' ')[0]).replace('null' , np.nan).astype(np.float)

The syntax does the following

  1. The column engine is converted to string
  2. The lambda function splits the string and return the first element
  3. After mapping the values, it replaces the nan with np.nan
  4. The values are converted to np.float

Legit question — Why are we converting nan’s to np.nan? Its because np.nan allows for vectorized operations; its a float value and makes computations easier and more efficient.

Now, we will make the Mileage consistent:

#since mileage has two units, to keep it consistent, we will convert the km/kg to km/ldef mileage_convert(x):
if type(x) == str:
if x.split()[-1] == 'km/kg':
return float(x.split()[0])*1.40 #formula
elif x.split()[-1] == 'kmpl':
return float(x.split()[0])
else:
return x

data['Mileage(km/l)'] = data['Mileage'].apply(mileage_convert)

Formula

The formula, first requires us to find the weight of petrol (borrowed from Quora)

Density range of petrol is 0.710 to 0.775 gram/ litre at 15 deg C.Formula for conversion: Mass =Density x VolumeMass (kg) for one litre= 0.710*1= 0.710 kg or 710 gm

Next we change the units

km/kg to km/l1 kg = 0.775lkm/kg = 1km / 0.710 = 1.4084507... km/l

Hence the steps are as follows:

  1. Convert data to string
  2. Extract the unit (2nd element) and the value (1st unit), from this, we see whether the unit is kmpl or km/kg
  3. If the data is in km/kg, we multiple it with 1.4 to change it to km/l

Treating Missing Values

We first visualize the missing values to get a better view,

missing value

We can see that majority of the missing values are in New_Price, to treat them, we can do the following:

  1. check the percentage of missing values in the columns
  2. If the percentage increases a certain threshold, in our case its 75, we drop the column
  3. For other columns, we impute them. (Median for numerical data and most frequent for categorical data)
#see null data
d = data.isnull().any()

#calculating percentage of missing data and storing in a dictionary
k = {}
for i,j in zip(d,data):
if i==True:
print("{} - {} % ".format(j,data[j].isnull().sum()/(data.shape[0]/100)))
#print(Dtype(j))
#print(i,j)
k[j] = data[j].isnull().sum()/(data.shape[0]/100 )


#removing columns with more that 75% missing data
for i in k:
if k[i]>=75:
data = data.drop([i],axis=1)

#filling rest with mean
for i in data:
if data[i].dtype!=object:
#print(type(i))
data[i].fillna(int(data[i].median()), inplace=True)

After these steps, lets see the missing value chart

clean!

Treating outliers

To understand the outliers, we will plot boxplots of the data, and treat them accordingly. We should never remove outliers without analysing them.

Let us go data wise:

Year

boxplot for year

If we see the minimum value of Year, its 1998, which is not “impossible”, that is why, we wont do any changes to this.

Kilometers Driven

box plot for km driven

We can see that we do have outliers here, the maximum value was way above the normal km’s of normal car.

Therefore, we needed to treat it. I have used Winsorization to do so, the article below explains the details of it.

Detecting and Treating Outliers In Python — Part 3 | by Alicia Horsch | Towards Data Science

data['Kilometers_Driven'] = winsorize(data['Kilometers_Driven'].values, limits = [0, 0.01]).data

We are putting threshold on k=1 on the right tail.

To understand the full working, please see this notebook.

treated boxplot

In the same way, we will check and treat other columns.

Decreasing number of unique values

There are two columns in which we can decrease the number of unique entity:

  1. Name
  2. Location

From Name, we can cluster the cars as their types, which is

  • Hatchback
  • Sedan
  • SUV
  • MUV
  • Crossover
  • Coupe
  • Convertible

And for Location, we can cluster them as regions.

def extract_region(x):
if x == 'Delhi' or x == 'Jaipur':
return 'North'
elif x == 'Bangalore' or x == 'Chennai' or x == 'Coimbatore' or x == 'Hyderabad' or x == 'Kochi':
return 'South'
elif x == 'Kolkata':
return 'East'
if x == 'Mumbai' or x == 'Pune' or x == 'Ahmedabad':
return 'West'

data['Region'] = data['Location'].apply(extract_region)

Encode Categorical Data

To encode the categorical data, there are two ways:

  1. Using labels
  2. One_hot_encoding

To attach labels to each type, we can do the following:

#first we will see the unique values in each of the variables
data.Fuel_Type.unique()

The output:

And then, we use apply and mapping to give labels.

#encoding
def encode_fuel(x):
if x == 'CNG':
return 1
elif x == 'Diesel':
return 2
elif x == 'Petrol':
return 3
elif x == 'LPG':
return 4
else:
return 5

data['encoded_Fuel_Type'] = data['Fuel_Type'].apply(encode_fuel)

To get one hot encoding, we use pd.get_dummies :

pd.get_dummies(data['Fuel_Type'])
after one hot encoding

In this, the “1” refers to value the row has, the other features are “0”.

Compare before after imputation value of target variable

The target variable, “New_price” had more than 75% of its data missing, hence we dropped it. But, for training a model, we will need that column. To impute a column, its better to use the median because it not effected by the extreme values.

Let us compare the histogram of before and after:

Before imputation, we can see that the plot is positively skewed, with quite a number of value bins:

Before imputation

After imputation, the distribution of the plot remains the same, but the number of bins have changed considerably. The values that were at 200+ also seem to have diminished.

Bivariate Analysis

This is the most interesting part of eda, this is the place where we see the graphs and infer data from it, lets see a few:

Location and Price:

  • Coimbatore and Bangalore are forward, which will effect the region wise price too.
  • Kolkata, Kochi and Pune are almost of the same line.

What can be the reason, all these states are metros, then why is there such unevenness in price?

KM vs Price

  • Faint correlation
  • Price of less driven cars are more (Duh!)
  • One of the most driven car is highly priced, why? maybe because that is an high end car.
  • Correlation can be seen
  • Why is there such crashes and peaks? is it because of the models?
  • Do models and company matter more than anything?

Model and Price

  • Lamborghini is highest (way too high), but other high end companies like Audi and Porsche are comparatively lower.

Isn’t is interesting!

Notebook — eda and visualizations | Kaggle

If you have been with me so far, please give an upvote :)

See you soon.

Photo by Kelly Sikkema on Unsplash

--

--