In statistics, Regression analysis includes many techniques for modelling and analyzing several variables when the focus is on the relationship between dependent and the independent variables. More specifically, regression analysis helps us to understand how a typical dependent variable would change when one of the independent variable would change, while other independent variables pertinent to the dependent variable remain unchanged. Let us consider a simple example to understand the concept of regression: Suppose, we are trying to predict levels of stress from the amount of time until one has to give a talk. One would expect this to be a negative relationship, as the smaller the amount of time until the talk, the larger the anxiety. This is what we had called Correlation. Now, if the question is extended to: If there’s 10 minutes to go until someone has to give a talk, how anxious will they be? This is the essence of Regression Analysis.
The simple linear regression model explains the causality relationship between the dependent variable and a single independent variable. We use the case study to explain the different important characteristics of the model. The case study analyses various factors which explain customer satisfaction for the retail giant. This is basically a rating data set where the customers have rated various departments. Based on this data will try to understand how it can improve its customer satisfaction.
The first step is Importing the data
Then we Generate the random numbers
The Next step is Splitting of the dataset
R generates the random number for the following dataset, in order to facilitate the splitting of the dataset into training and validation dataset.
Before proceeding with any predictions using the CLRM we must examine whether the assumptions to the model are satisfied. The three most important assumptions are:
• There should not be any multicollinearity among the explanatory variables
• There should not be autocorrelation among the errors
• The variance of the error terms should be constant
But in R in order to conduct the tests we will have to initially design a linear equation and create a model out of it.
This object is responsible in storing the independent variables in a equation form with a plus sign in between them by using the paste().
Here, we are creating a linear equation by equating the dependent variable which is Customer_Satisfaction here, we are also checking for the class of the object created.
Conducting the tests
To the test for multicollinearity, auto-correlation and heteroscedasticity we will have to ensure that the package car to do the tests are present in R.
From the output we got, we can infer that Delivery_Speed is causing multicollinearity in this particular dataset due to his high VIF value, hence we drop it from the analysis. However, this technique is not recommended when more than one variable is to be dropped. Then it might lead to the loss in a significant proportion of explanatory capacity in the model. These explanatory variables may enter into the error term which might make the error term systematic in nature.
Then we need to drop some variables.
Instead of 14 we construct the object till the 13th variable, which ensures dropping the Delivery_Speed, and the rest of the code follows in the same manner, till we test for multicollinearity till the 13th variable.
Further, we can infer that Delivery_Speed is causing multicollinearity in this particular dataset due to his high VIF value, hence we drop it from the analysis. However, this technique is not recommended when more than one variable is to be dropped. Then it might lead to the loss in a significant proportion of explanatory capacity in the model. These explanatory variables may enter into the error term which might make the error term systematic in nature.
We will see that all the VIF values are smaller than 5, hence we can conclude that the multicollinearity is under check.
Final regression model building
Here FinalModel will consist of a collection of significant variables affecting the dependent variable based on the best fitting of the model depending on the AIC value. It works very similar to the adjusted R-square concept, but it works the other way round. As the AIC value keeps dropping for each kind of model, it is considered as the best fitting model, step() allows us to do that. The argument direction =”both” suggests that the test is performed in a two-way manner, where each variable is tested for it’s significance before a model is created individually at each step, in the forward manner it will keep adding the significant variables and keep out the non-significant ones whereas in the backward method it will chuck out the insignificant ones. One can opt for these methods separately.
In the outcome you will see that the number of significant variables are indicated by the stars alongside their p-values. The adjusted R-square of the model is near to 0.80 and the p-values is less than the alpha value which induces us to accept the alternate hypothesis, which suggests the presence of significant variable in the model.
Checking for the accuracy of the model
The accuracy of the model can be analysed by finding the correlation between the predicted value of the customer satisfaction and the actual customer satisfaction which is already present in the dataset. We conduct the correlation test both on the training dataset and the validation dataset.
The correlation coefficient between the predicted satisfaction and the customer satisfaction obtained in the validation data set gives the analyst an idea about the correctness of prediction of the model. To get an idea about the robustness of the result obtained we compare the correlation co-efficient between the predicted satisfaction and the actual correlation coefficient in the training and the validation data set. A difference of 5%-6% in the correlation coefficient values in the two sets is admissible. In the report that we have generated we obtain a difference of 3% in the correlation coefficient measure. So, we can conclude that our model estimates are robust.
Multiple Linear Regression Model can act as a predictive tool when the dependent variable is numeric or continuous in nature. However, what would happen if the explanatory variables are a synthesis of categorical, binary and numeric variables and the dependent variable is dichotomous in nature. In such a scenario, we use the logistic regression.