Nowadays there are plenty of courses and tutorials that focus on model building, and very less courses teaching about data. We do not have data to fit the model, instead, models fit the data. In this series of “Love Data” we are going to understand some underlying aspects of understanding data, and understanding which model is needed for our data.

# Introduction

Regression is a parametric method, which means that we make certain assumptions about the functional form, shape of *f, *where *f *is the underlying true function that we want to learn.

For Regression, following are the assumptions that are followed

- Linearity
- No Autocorrelation
- No Multicollinearity
- Homoscedasticity
- Errors are normally distributed

Let us see what these are in greater detail:

**Linearity**

A regression model should be linear in parameters and have an additive error term. By **linear in parameter** it is meant that, the population regression function can be written as:

where, **Y** is the dependent variable, **X** is the independent variable, **β**’s** **are the partial slope coefficients (parameters of interest), and **ε** is the random error term.

A function need not be linear to support the linearity assumption, few examples are:

# No Autocorrelation

Since, the observations are assumed to be randomly sampled, so the error values should be independent to each other and not be related to each other. If the error do have a relation, then we call it as *autocorrelation *or *serial correlation.*

In mathematical terms, we can say that,

Autocorrelation is quite common in data involving time series, because if something occurs today, its influence isn’t likely to be completely absorbed today.

This graph shows no such patterns, and thus we can say that the errors are independent. On the other hand, the graph below shows a clear rising trend, which indicates that the errors are correlated and thus the presence of autocorrelation.

Apart from graphical method, we also use tests like **Durbin Watson** statistics to understand if the data has autocorrelation or not.

# No Multicollinearity

If changes in one independent variable, causes change in some other independent variable, we say that there is **multicollinearity** in our model. Multicollinearity are of two types, *perfect multicollinearity* and *high multicollinearity*. In practice, perfect collinearity is uncommon and can be avoided with careful attention to the model’s independent variables. On the other hand, high multicollinearity is quite common and can create severe prediction problems.

Few reasons for high multicollinearity can be the following:

- Use of variables that are lagged values of one another.
- Use of variables that share a common time trend component.
- Use of variables that capture similar phenomena.

# Homoscedasticity

Homoscedasticity is a crucial assumption in classical linear regression which means that the variance of the error term is constant over various values of the independent variables.

If the error terms are homoscedastic then the *dispersion of the error *remains the same, **regardless of the error term.**

When the assumption of heteroskedasticity is violated, the OLS estimators do not remain BLUE (Best Linear Unbiased Estimator). Specifically, in case of heteroskedasticity, the estimators **do not remain** **efficient.**

In addition, the estimated standard errors of the coefficients will be biased, which results in **unreliable hypothesis tests** (t-statistics). However, the **OLS estimators will remain unbiased.**

To detect heteroskedasticity, the following can be done

- Graphical examination of the residuals
- Breusch-Pagan Test
- White’s Test
- Goldfeld-Quandt test
- Park Test

# Errors are normally distributed

The normality assumption states that

For any given X value, the error term follows a normal distribution with a zero mean and constant variance

In mathematical form, it can be expressed as follows:

Graphically, we can show the normality assumption as follows:

The error term contains many factors(random variables) that influence the dependent variable Y, and are not captured by the independent variable X. **The central limit theorem indicates that the sum or mean of random variables is normally distributed as long as many random variables are present and the influence of any one random variable is small.**

# Conclusion

Though linear regression is a very simple and powerful technique, there are many factors that we need to keep in mind while predicting using this technique. We should always select a model that fits our data, rather than trying to get the data fit the model. Looking through the data and understand the residuals of regression, can help us in better prediction, and better estimators.