Confusing Stats Terms Explained: Multicollinearity

Multicollinearity said in "plain English" is redundancy. Unfortunately, it isn't quite that simple, but it's a good place to start.  Put simply, multicollinearity is when two or more predictors in a regression are highly related to one another, such that they do not provide unique and/or independent information to the regression.

A common example that is used to demonstrate this idea is having both height and weight as predictors in a regression model. Although they represent different aspects, both are measures of a person's size and one is highly related to the other (taller people are likely to weigh more than shorter people, on average). Now I know you may be thinking that this may not always be true, but fashion models aside, this is generally true enough to cause problems if they are entered into a model together.

Knowing what multicollinearity generally means is one thing, but knowing why it's a problem and what to do about it is also critical. First, the consequences:

The consequences of multicollinearity are basically two fold:

  1. Loss of reliability in the estimate of effects for individual predictors in your model (in particular the predictors that are problematic).
  2. Findings that are strange, potentially misleading, and generally just don't make much sense.

Notice here that I specified that the estimates of the individual predictors may be less reliable. I specifically pointed to those estimates because the estimate of the overall variability explained in your dependent variable (known as R-squared) is generally not impacted by multicollinearity! Thus, if  you are only interested in how much a group of predictors are able to predict a dependent variable (DV),  and don't care about how much each individual predictor is able to predict the DV, then you need not worry about multicollinearity! But if you are hoping to examine individual predictors, then please read on!

Let's look at an example of how multicollinearity might make the estimates of predictors less reliable and leave you generally lost and confused with your results:

0511081105183640_Cartoon_Mug_of_Beer_clipart_image.jpg.png

Imagine for a moment a researcher named Hugo. Hugo is hoping to determine what variables predict self-reported graduate student happiness (which was measured with a survey measure that produced an interval-scale composite score). In addition, Hugo also collected other information through survey scores about the students' experiences and perceptions, two of which were "hours per week available to socialize" and "hours per week spent  studying." For our example, let's pretend that these two variables were very highly (negatively) correlated (r= -.91).

To see which variable is more predictive of student happiness, Hugo entered these two predictors into a regression, along with two others that he was unsure about (age and gender), expecting to find that both the "socializing" and "studying" variables would be positively related (predictive) to the graduate students' happiness, but perhaps one more than the other. To his surprise, none of the predictors appeared to be significant!

Do graduate students really not care if they have time to socialize or how much time they spend studying? Obviously not. A likely explanation here is multicollinearity.

Here is what happened to our inquisitive grad student's regression:

e57701b060d1484c_kids-study-cartoon.jpg

When predictors are in a regression together, the estimate of each of their effects on the DV is controlling for the other predictor, as well as all others in the model. Thus, when two variables that are very highly related (and may be measuring to the same or a similar constructs, in this case time management) are entered into the same regression, each of their effects will be estimated while controlling for a variable that explains all of the variability in happiness that it would have itself explained!  The result is neither variable appearing to be a significant predictor! Hugo was even more perplexed to see that the R-squared (percent of variability explained) was quite high, indicating that one or more of the predictors in the model must be a good predictor. As you might imagine, Hugo was very confused...

Multicollinearity made Hugo's results both confusing and misleading. In reality, both the "socializing" and "on-schedule" variables were strong related (with socializing slightly more strongly), but the multicollinearity made it appear that both were poor individual predictors of graduate student happiness. What can Hugo do to solve this problem? That is a great question.

To solve multicollinearity, you have a few options: 1) drop on of the variables that are causing the multicollinearity (check out THIS SITE for more great info on consequences) or 2) combine the variables (if they are truly representative of a shared construct, in our case Hugo might create a "time management" variable).

In Hugo's case, he decided that spending more hours per week studying would inevitably leave one less time to socialize, so the two variables were too highly related, leading him to decide to remove the "studying" variable. When he did, the "socialize" variable proved to be the strong predictor that he thought it would, while the other two predictors (age and gender) remained non-significant, and the percentage of variability of the DV explained remained high, all of which indicated that multicollinearity was responsible for Hugo's initially strange findings.

I describe the consequences of multicollinearity with some caution, because it is typically only in severe cases that the consequences are anything to worry about it. With that said, multicollinearity is nothing to be ignored and one should not assume that your instance is not one of the severe cases.