Run through of EDA and Feature engineering

8 min readAug 17, 2021

Exploratory data analysis is one of the most important step in any data science project. This is because, until and unless we understand the data we can will not be able to do any kind of task on it.

Data is not meant to fit the model, the model has to fit the data

During exploratory data analysis, we come across many facts about the data, which help us building models, these include:-

Outliers
Correlations
Patterns
Missing Data
Ways to feature engineer

etc.

In this project we will do a run through of an EDA project based on used cars dataset.

In this project we will do the following

Import required libraries
Convert data types and extract features
Treat Missing values
Identify and treat outliers
Decrease number of unique values (cluster)
Encode categorical data
Compare before after imputation value of target variable
Bivariate analysis

Importing Libraries

For data wrangling's:

import pandas as pd
import numpy as np

For visual representation of missing numbers

import missingno

For visualizations

import seaborn as sns
import matplotlib.pyplot as plt

For treating outliers

from scipy.stats.mstats import winsorize

Convert data types and extract features

Before going to conversion, let us first get a view of the dataset

Features like ‘Mileage’ , ‘Engine’ and ‘Power’ are string type and we only need the numerical number and not their unit. Another important thing to note are the units, they are not consistent, some are km/kg and some kmpl. We will extract these features too.

First, let us extract Brand and Model in from the feature Name.

#getting brand and model of car from name
data['Brand'] = data['Name'].map(lambda x : x.split(' ')[0])
data['Model'] = data['Name'].map(lambda x : x.split(' ')[1])

What exactly is happening? Let us go step by step:

The map() function executes a specified function for each item in an iterable. Hence in this, all the items of Name are passed through the lambda function.
The lambda function splits the string on the basis of “ ” and returns the first element and stores to Brand. The second element is stored to Model.

Next we extract the numerical value from Engine and Power.

#Converting to string and extracting first value, then converting to floatdata['Engine(cc)'] = data['Engine'].astype(str).map(lambda x : x.split(' ')[0]).replace('nan' , np.nan).astype(np.float)
data['Power(bph)'] = data['Power'].astype(str).map(lambda x : x.split(' ')[0]).replace('null' , np.nan).astype(np.float)

The syntax does the following

The column engine is converted to string
The lambda function splits the string and return the first element
After mapping the values, it replaces the nan with np.nan
The values are converted to np.float

Legit question — Why are we converting nan’s to np.nan? Its because np.nan allows for vectorized operations; its a float value and makes computations easier and more efficient.

Now, we will make the Mileage consistent:

#since mileage has two units, to keep it consistent, we will convert the km/kg to km/ldef mileage_convert(x):
    if type(x) == str:
        if x.split()[-1] == 'km/kg':
            return float(x.split()[0])*1.40 #formula
        elif x.split()[-1] == 'kmpl':
            return float(x.split()[0])
    else:
        return x
    
data['Mileage(km/l)'] = data['Mileage'].apply(mileage_convert)

Formula

The formula, first requires us to find the weight of petrol (borrowed from Quora)

Density range of petrol is 0.710 to 0.775 gram/ litre at 15 deg C.Formula for conversion: Mass =Density x VolumeMass (kg) for one litre= 0.710*1= 0.710 kg or 710 gm

Next we change the units

km/kg to km/l1 kg = 0.775lkm/kg = 1km / 0.710 = 1.4084507... km/l

Hence the steps are as follows:

Convert data to string
Extract the unit (2nd element) and the value (1st unit), from this, we see whether the unit is kmpl or km/kg
If the data is in km/kg, we multiple it with 1.4 to change it to km/l

Treating Missing Values

We first visualize the missing values to get a better view,

We can see that majority of the missing values are in New_Price, to treat them, we can do the following:

check the percentage of missing values in the columns
If the percentage increases a certain threshold, in our case its 75, we drop the column
For other columns, we impute them. (Median for numerical data and most frequent for categorical data)

#see null data
d = data.isnull().any()

#calculating percentage of missing data and storing in a dictionary
k = {}
for i,j in zip(d,data):
    if i==True:
        print("{} - {} % ".format(j,data[j].isnull().sum()/(data.shape[0]/100)))
        #print(Dtype(j))
        #print(i,j)
        k[j] = data[j].isnull().sum()/(data.shape[0]/100 )
        

#removing columns with more that 75% missing data
for i in k:
    if k[i]>=75:
        data = data.drop([i],axis=1)

#filling rest with mean    
for i in data:
    if data[i].dtype!=object:
        #print(type(i))
        data[i].fillna(int(data[i].median()), inplace=True)

After these steps, lets see the missing value chart

Treating outliers

To understand the outliers, we will plot boxplots of the data, and treat them accordingly. We should never remove outliers without analysing them.

Outlier: What to do?

We all know what outliers are “values different from others” or “IMPOSSIBLE values”. But how do we deal with them?

ipshita.medium.com

Let us go data wise:

Year

If we see the minimum value of Year, its 1998, which is not “impossible”, that is why, we wont do any changes to this.

Kilometers Driven

We can see that we do have outliers here, the maximum value was way above the normal km’s of normal car.

Just how many maximum kilometers can a car do?

Cars, specifically modern cars that we see running around today, are way more reliable than the ones that were being…

philkotse.com

Therefore, we needed to treat it. I have used Winsorization to do so, the article below explains the details of it.

Detecting and Treating Outliers In Python — Part 3 | by Alicia Horsch | Towards Data Science

data['Kilometers_Driven'] = winsorize(data['Kilometers_Driven'].values, limits = [0, 0.01]).data

We are putting threshold on k=1 on the right tail.

To understand the full working, please see this notebook.

In the same way, we will check and treat other columns.

Decreasing number of unique values

There are two columns in which we can decrease the number of unique entity:

Name
Location

From Name, we can cluster the cars as their types, which is

Hatchback
Sedan
SUV
MUV
Crossover
Coupe
Convertible

And for Location, we can cluster them as regions.

def extract_region(x):
    if x == 'Delhi' or x == 'Jaipur':
        return 'North'
    elif x == 'Bangalore' or x == 'Chennai' or x == 'Coimbatore' or x == 'Hyderabad' or x == 'Kochi':
        return 'South'
    elif x == 'Kolkata':
        return 'East'
    if x == 'Mumbai' or x == 'Pune' or x == 'Ahmedabad':
        return 'West'
    
data['Region'] = data['Location'].apply(extract_region)

Encode Categorical Data

To encode the categorical data, there are two ways:

Using labels
One_hot_encoding

To attach labels to each type, we can do the following:

#first we will see the unique values in each of the variables
data.Fuel_Type.unique()

The output:

And then, we use apply and mapping to give labels.

#encoding
def encode_fuel(x):
    if x == 'CNG':
        return 1
    elif x == 'Diesel':
        return 2
    elif x == 'Petrol':
        return 3
    elif x == 'LPG':
        return 4
    else:
        return 5

data['encoded_Fuel_Type'] = data['Fuel_Type'].apply(encode_fuel)

To get one hot encoding, we use pd.get_dummies :

pd.get_dummies(data['Fuel_Type'])

In this, the “1” refers to value the row has, the other features are “0”.

Compare before after imputation value of target variable

The target variable, “New_price” had more than 75% of its data missing, hence we dropped it. But, for training a model, we will need that column. To impute a column, its better to use the median because it not effected by the extreme values.

Let us compare the histogram of before and after:

Before imputation, we can see that the plot is positively skewed, with quite a number of value bins:

After imputation, the distribution of the plot remains the same, but the number of bins have changed considerably. The values that were at 200+ also seem to have diminished.