Nowadays there are plenty of courses and tutorials that focus on model building, and very less courses teaching about data. We do not have data to fit the model, instead, models fit the data. In this series of “Love Data” we are going to understand some underlying aspects of understanding data, and understanding which model is needed for our data.
Regression is a parametric method, which means that we make certain assumptions about the functional form, shape of f, where f is the underlying true function that we want to learn.
For Regression, following are the assumptions that are followed
- No Autocorrelation
- No Multicollinearity
- Errors are normally distributed
Let us see what these are in greater detail:
A regression model should be linear in parameters and have an additive error term. By linear in parameter it is meant that, the population regression function can be written as:
where, Y is the dependent variable, X is the independent variable, β’s are the partial slope coefficients (parameters of interest), and ε is the random error term.
A function need not be linear to support the linearity assumption, few examples are:
Since, the observations are assumed to be randomly sampled, so the error values should be independent to each other and not be related to each other. If the error do have a relation, then we call it as autocorrelation or serial correlation.
In mathematical terms, we can say that,
Autocorrelation is quite common in data involving time series, because if something occurs today, its influence isn’t likely to be completely absorbed today.
This graph shows no such patterns, and thus we can say that the errors are independent. On the other hand, the graph below shows a clear rising trend, which indicates that the errors are correlated and thus the presence of autocorrelation.
Apart from graphical method, we also use tests like Durbin Watson statistics to understand if the data has autocorrelation or not.
If changes in one independent variable, causes change in some other independent variable, we say that there is multicollinearity in our model. Multicollinearity are of two types, perfect multicollinearity and high multicollinearity. In practice, perfect collinearity is uncommon and can be avoided with careful attention to the model’s independent variables. On the other hand, high multicollinearity is quite common and can create severe prediction problems.
Few reasons for high multicollinearity can be the following:
- Use of variables that are lagged values of one another.
- Use of variables that share a common time trend component.
- Use of variables that capture similar phenomena.
Homoscedasticity is a crucial assumption in classical linear regression which means that the variance of the error term is constant over various values of the independent variables.
If the error terms are homoscedastic then the dispersion of the error remains the same, regardless of the error term.
When the assumption of heteroskedasticity is violated, the OLS estimators do not remain BLUE (Best Linear Unbiased Estimator). Specifically, in case of heteroskedasticity, the estimators do not remain efficient.
In addition, the estimated standard errors of the coefficients will be biased, which results in unreliable hypothesis tests (t-statistics). However, the OLS estimators will remain unbiased.
To detect heteroskedasticity, the following can be done
- Graphical examination of the residuals
- Breusch-Pagan Test
- White’s Test
- Goldfeld-Quandt test
- Park Test
Errors are normally distributed
The normality assumption states that
For any given X value, the error term follows a normal distribution with a zero mean and constant variance
In mathematical form, it can be expressed as follows:
Graphically, we can show the normality assumption as follows:
The error term contains many factors(random variables) that influence the dependent variable Y, and are not captured by the independent variable X. The central limit theorem indicates that the sum or mean of random variables is normally distributed as long as many random variables are present and the influence of any one random variable is small.
Though linear regression is a very simple and powerful technique, there are many factors that we need to keep in mind while predicting using this technique. We should always select a model that fits our data, rather than trying to get the data fit the model. Looking through the data and understand the residuals of regression, can help us in better prediction, and better estimators.