Introduction

Logistic Regression is a regression in Which the dependent Variable is Binary ( Dichotomous, usually) in Nature. That means the Dependent Variable has two possible outcome, based on the set of Independent or Explanatory Variables.Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Where do we use Logistic Regression ?

Banking

A bank lends out credit to its customers. But, the bank wants to be sure about the credibility of the borrower. So, the possible occurrences are: The borrower is a defaulter or the borrower is not a defaulter. The outcome is binary in nature and explained by qualitative and quantitative factors such as income of the borrower, past credit history etc. The Binary representation of the credit risk problem is:

Y=1 if a default occurs.

   =0 if no default occurs.

Churn

Suppose, a mobile company wants to retain its customers who are leaving the network or are changing from a high expenditure plan to a low expenditure plan. Here, the dependent variable will take two values:

Y=1 if there is an element of Customer Churn.

   =0 if there is no customer churn.


In all these cases, it is to be observed that 1 is assigned to all those variables which are more important to the problem of analysis.

In simple terms, If you have a dataset with marks of a student in five subjects and you have to predict the marks in another subject it will be a regression problem. On the other hand, if I ask you to predict whether a student is pass or fail based on the marks it will be a classification problem.

Why do we use logistic regression and not linear regression in such cases?

The answer is simple: In linear regression the dependent variable is continuous, i.e. for any given value of the independent variable, the dependent takes a unique value. However, for logistic regression the dependent variable is binary. However, if we consider a Multinomial Logistic model then it can record polytomous response model. However, it does not necessarily mean that whenever the response or the dependent variable is binary, we should go for logistic regression? Here, we need to talk about an alternate model called the linear probability model.

Principles behind Logistic Regression

Given the values of Y and X1, the unknown parameters in the equation can be estimated by finding a solution for which the squared distance between the observed and predicted values of the dependent variable is minimized.

In logistic Regression, instead of predicting the value of a variable Y from a predictor variable X 1 or several predictor variables (Xs), we predict the probability of Y occurring given known values of X1. The logistic regression equation bears many similarities to the regression equation just described. In its simplest form, when there is one predictor variable X1, the logistic regression equation from which the probability of Y is predicted is given by the equation:

These are used for defining:
• Odds of an event
• Odds ratio

Odds of an Event

The odds in favour of an event or a proposition are the ratio of the probability that the event will occur to the probability that it will not occur.

For example: the odds that a randomly chosen day of the week is a Sunday are one to six. Therefore, we can define the odds as follows:

Let us consider the following example to illustrate the concept of logistic regression.

Consider the following dichotomous case:

The ‘Odds’ mean that if India plays a total of 5 matches, then they have a very high probability of winning 3 matches and losing the other two.

Odds ratio is the primary measure of the effect size in logistic regression and is computed to com-pare the odds that membership in one group will lead to a case outcome with the odds that membership in some other group will lead to a case outcome. The Odds-ratio is simply the odds of being a case for one group divided by the odds of being the case for another group.

Let us consider an example to illustrate the idea of Odds Ratio:
Suppose; we have two groups of people: Good eye sight and Poor Eye sight. Consider the following occurrences:

Now, the group with good vision would have a lower probability of accident, say P (Y=1) = 0.4; P (Y = 0) = 0.6. Similarly, the group with a bad vision would have a very high probability of the accident: P (Y=1) =0.7; P (Y=0) = 0.3. So,

Odds ratio is the ratio of odds of the group of the group of the people with poor vision divided by the group of the people with good vision. That is :

The interpretation is that a person with a poor vision has a greater chance or odds than a person with a good vision to face an accident. The odds can be represented generally for a logistic equation as follows:

Taking logarithms on both sides, we have:

This expression above tells us that if X changes by 1 unit what would be the impact on the dependent variable.

Now, suppose we are considering whether a person will get promotion or not. In this case, the explanatory variables are the number of working hours. In this problem β, should tell us that if the number of working hours increases by 1 unit, i.e. if the person works for 1 extra hour, what will be the impact on the log of probability in favour of getting a promotion.

To avoid these complications, we stick to the odds ratio:

This shows the initial value of the Odds Ratio when no increase in the working hours have been made. So, after incrementing the working hours by 1 hour the odds ratio become:

So if the working hours changes by 1 hour, then probability in favour of getting a promotion changes eβ times.

From our discussions of odds-ratio we can make the following observations:
• An odds ratio of one indicates that the odds of case outcomes are equally likely for both groups under consideration.
• The more odds deviate from one; the stronger is the relationship.
• The odds ratio has a floor of zero but no upper limit.

Methods of Logistic Regression

There are different methods which can be used in logistic regression. But here we will primarily discuss about the stepwise selection method.
The stepwise selection methods are divided into two types:
• Forward Selection Method
• Backward Selection Method

Forward Selection Method
When this method is employed the computer begins with a model that includes only a constant and then adds single predictors into the model based on a specific criterion. The criterion is called a score statistic. The variable with the most significant score statistic is added to the model. The software proceeds until none of the remaining predictors have a significant score statistics. The soft-ware also examines the variables in the model to see whether they should be removed.

Backward Selection Method
The process is complete opposite to the method we have discussed under the forward selection method.

Understanding the relation between the Observed and Predicted Outcomes

It is very important to understand the relation between the observed and predicted outcome. The performance of the model can be benchmarked against this relation.

Let us consider the following table :

In this table, we are working with unique observations. The model was developed for Y = Yes. So it should show high probability for the observation where the real outcome has been Yes and a low probability for the observation where the real outcome has been No.

Consider the observations 1 and 2. Here the real outcomes are Yes and No respectively, and the probability of the Yes event is greater than the probability of the No event. Such pairs of observations are called Concordant Pairs. This is in contrast to the observations 1 and 4. Here we get the probability of the No is greater than the probability of Yes. But the data was modelled for P (Y = Yes). Such a pair is called a Discordant Pair. Now consider the pair 1 and 3. The probability values are equal here, although we have opposite outcomes. This type of pair is called a Tied Pair. For a good model, we would expect the number of concordant pairs to be fairly high.

In ideal case, all the yes events should have very high probability and the no events with very low probability as shown in the left chart. But the reality is somehow like the right chart. We have some yes events with very low probability and some no events with very high probability.

It is a very subjective issue to decide on the cut-point probability level, i.e. the probability level above which the predicted outcome is an Event i.e., Yes. A Classification Table can help the researcher in deciding the cut-off level. A Classification table has several key concepts.

Some key terminologies associated are given below;

Receiver Operating Characteristics Curve

Receiver operating characteristic (ROC) curves are useful for assessing the accuracy of predictions. In a ROC curve the Sensitivity is plotted in function of 1 - Specificity for different cut-off points of a parameter. Each point on the ROC curve represents a sensitivity/ specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two groups. In our case the parameter is the probability of the event.

A ROC plot shows:
• The relationship between sensitivity and specificity. For example, a decrease in sensitivity results in an increase in specificity.
• Test accuracy : the closer the graph is to the top and left-hand borders, the more accurate is the test. Likewise, the closer the graph to the diagonal, the less accurate the test. A perfect test would go straight from zero up the top-left corner and then straight across the horizontal.
• The likelihood ratio : given by the derivative at any particular cutpoint.

ROC curve shows sensitivity on the Y axis and 100 minus Specificity on the X axis. If predicting events (not non-events) is our purpose, then on Y axis we have Proportion of Correct Prediction out of Total Occurrence and on the X axis we have proportion of Incorrect Prediction out of Total Non-Occurrence for different cut-points.