At some point in our lives we have all heard that multicollinearity is not good. But what exactly is it, why is it a problem, how can we detect it, and what are some solutions to it? This blog post will answer all of these questions and will also show empirically the effects of multicollinearity with a demo in R.

*What is multicollinearity?*

Multicollinearity is the state of there being linear dependence among two or more predictors in a regression model (or any forms of the generalized linear model like logistic regression, ANOVA, etc.). In plain English, if your predictors are strongly correlated with one another, you’ve got multicollinearity. Typically a correlation coefficient of .8 or .9 among predictors indicates multicollinearity.

*Why is it a problem?*

In order to understand why multicollinearity is a problem, think about the goals of multiple regression. You want to know the effect (on average) that a 1-unit change in one variable (your predictor) has on another variable (your outcome). If you were predicting housing prices (Y) based on the square footage of the house (X1) and the nearby school quality (X2), a multiple regression model would tell you not just whether square footage and school quality are significantly related to price, but by how much. This “how much” question is answered by the beta (slope) coefficients in the model output, which tell us, “for every 1-unit change in X1, holding constant levels of X2, you can expect Y to change by the value of the X1 coefficient”.

Multicollinearity is a problem because linear dependence among predictors can cause instability in the estimates of our slope coefficients. This means the resulting t-statistics and p-values for those predictors are not reliable. In our previous example with real estate, if you were considering adding a room to your house to increase the square footage, the best model would tell you *precisely* (with confidence) by how much you can expect the price of your house to increase. But to the extent there is multicollinearity in the predictors, our confidence in that estimate is undermined.

So where does this lack of estimate precision come from? The nuts and bolts of it is that it all has to do with instability in the design matrix when computing a least squares regression line. Below is a set of matrices and some made up data:

- A 5×5 matrix X representing 5 observations of 5 parameters (the intercept X0 and four predictors X1, X2, X3, and X4)
- The transpose of X
- A 5×1 matrix Y representing the 5 observations’ responses on the outcome variable

The equation for calculating the slope coefficients in a linear regression model is shown below:

Basically you can see this equation inverts the product of X and X-transposed and then multiplies it by the product of X-transposed and Y. This is how in linear regression models the computer formulates a line of best fit. And the reason why multicollinearity will muck up this formulation (for reasons we won’t detail too much) is that the inverse of a square matrix can only exist if all the columns are linearly independent. So if there is linear dependence, it becomes impossible to carry out that inversion part of the equation.

So, that is the formal reason why perfect multicollinearity will negate the least squares calculation. But what about when predictors are maybe not perfectly linear dependent, but instead are something close (like correlated at .9)? Here, the matrix inversion can occur, and the equation above will generate the slope coefficients, but those estimates will end up being unstable and lacking precision (if you are interested in knowing why, check out this page).

*Simulating the effects of multicollinearity in R*

Let’s now show the effects of multicollinearity empirically in R. First we will set a repetition counter to 10,000 cycles and initialize two empty data frames, one for a linearly independent model (without multicollinearity) and one for a linearly dependent model (with multicollinearity). We will investigate how multicollinearity affects the p-values and standard errors of our regression slope coefficients. If theory is correct, we should see the linearly-dependent model yielding larger standard errors and more fluctuation in p-values.

set.seed(051617) reps = 10000 uncorrelated_mod_dump <- data.frame(x1_pval = numeric(reps), x1_se = numeric(reps), x2_pval = numeric(reps), x2_se = numeric(reps), x3_pval = numeric(reps), x3_se = numeric(reps), cor_x1x2 = numeric(reps), cor_x1y = numeric(reps)) correlated_mod_dump <- data.frame(x1_pval = numeric(reps), x1_se = numeric(reps), x2_pval = numeric(reps), x2_se = numeric(reps), x3_pval = numeric(reps), x3_se = numeric(reps), cor_x1x2 = numeric(reps), cor_x1y = numeric(reps))

For each of the 10,000 iterations we will simulate two models and store the results in their respective data frames:

Model 1 (uncorrelated-predictors model) will be a model with 3 linearly independent predictors and X1 moderately correlated to Y (r = ~.45). Model 2 (correlated-predictors model) will be the same model, but we’ll adjust X2 to be correlated with X1 at r = ~.95 (this will inject some multicollinearity in it).

for(i in 1:reps){ #linearly independent predictors with x1 predicting y x1 <- rnorm(100, 2, 3) x2 <- rnorm(100, 2, 3) x3 <- rnorm(100, 2, 3) y <- x1 + rnorm(100, 5, 6) lm_uncorrelated <- summary(lm(y ~ x1 + x2 + x3)) uncorrelated_mod_dump$cor_x1y[i] = cor(x1,y) uncorrelated_mod_dump$cor_x1x2[i] = cor(x1,x2) uncorrelated_mod_dump$x1_pval[i] = lm_uncorrelated$coefficients[2,4] uncorrelated_mod_dump$x1_se[i] = lm_uncorrelated$coefficients[2,2] uncorrelated_mod_dump$x2_pval[i] = lm_uncorrelated$coefficients[3,4] uncorrelated_mod_dump$x2_se[i] = lm_uncorrelated$coefficients[3,2] uncorrelated_mod_dump$x3_pval[i] = lm_uncorrelated$coefficients[4,4] uncorrelated_mod_dump$x3_se[i] = lm_uncorrelated$coefficients[4,2] #correlate x2 with x1 and rerun model x2b <- x1 + rnorm(100, 0, 1) lm_correlated <- summary(lm(y ~ x1 + x2b + x3)) correlated_mod_dump$cor_x1y[i] = cor(x1,y) correlated_mod_dump$cor_x1x2[i] = cor(x1,x2b) correlated_mod_dump$x1_pval[i] = lm_correlated$coefficients[2,4] correlated_mod_dump$x1_se[i] = lm_correlated$coefficients[2,2] correlated_mod_dump$x2_pval[i] = lm_correlated$coefficients[3,4] correlated_mod_dump$x2_se[i] = lm_correlated$coefficients[3,2] correlated_mod_dump$x3_pval[i] = lm_correlated$coefficients[4,4] correlated_mod_dump$x3_se[i] = lm_correlated$coefficients[4,2] }

Now let’s check our results. First let’s see what proportion of simulations detected a significant relationship between X1 and Y. Remember, this correlation was exactly the same in both models at about r = .45.

sum(uncorrelated_mod_dump$x1_pval <= .05)/reps sum(correlated_mod_dump$x1_pval <= .05)/reps

With the uncorrelated model 99.65% of the simulations detected the significant relationship between X1 and Y. In the correlated (with multicollinearity) model this percentage dropped to an abysmal 33.79%.

One may say, though, that perhaps this low rate is just explained by X2 being allocated all of the variance that it shared with X1. If this were the case, most of the simulations of the correlated model should pick up a significant X2-Y relationship. This would suggest the correlated-predictors model wasn’t necessarily performing worse overall, but perhaps it was just vacillating between X1 and X2 when selecting the predominant predictor of Y.

To investigate this, for each model we’ll aggregate the results across 10,000 trials and check the X1, X2, and X3 slope coefficients’ mean p-values and standard errors. If multicollinearity was just causing fluctuation in selecting the most significant predictor, we should see a combined total of ~99.65% of simulations yielding significant p-values. But if the model actually get worse overall with multicollinearity, we should see a reduced proportion of statistically significant relationships (of both X1-Y and X2-Y combined). In addition, just to double check that our simulation worked as intended, we’ll check the correlations between X1 and X2 and between X1 and Y.

round(apply(uncorrelated_mod_dump, 2, mean), digits = 4) round(apply(correlated_mod_dump, 2, mean), digits = 4)

Table 1. Aggregated Values Across 10,000 Simulations

X1 p-value | X1 S.E. | X2 p-value | X2 S.E. | X3 p-value | X3 S.E. | cor(X1,X2) | cor(X1,Y) | cor(X2,Y) | |

Uncorrelated-predictors Model |
0.0010 | 0.2039 | 0.4954 | 0.2040 | 0.5000 | 0.2041 | -0.0002 | 0.4456 | -0.0009 |

Correlated-predictors Model |
0.2376 | 0.6455 | 0.5010 | 0.6126 | 0.4997 | 0.2041 | 0.9483 | 0.4456 | 0.4229 |

Remember X1 was moderately correlated to Y at r=.45. Notice in the uncorrelated-predictor model (no multicollinearity) the average p-value across 10,000 simulations was .001, indicating it was able to pick out this correlation. But in the correlated-predictors model (with multicollinearity) the average p-value for the X1-Y relationship jumps to .2376, suggesting a wider range of outcomes/p-values. Indeed, the SD of p-values across 10,000 simulations was .267, whereas it was just .009 in the uncorrelated-predictors model. You can also see the standard error of the X1 coefficient estimate increases by a factor of 3. Just as theory would predict with multicollinearity we are seeing more instability (less precision) in our estimates.

Interestingly, too, For X2 and X3, the mean p-values are about .5. This indicates they are likely uniformly distributed (the expected distribution under the null). For X3 this makes perfect sense. Across both models X3 is truly uncorrelated with Y.

But what about X2 in the correlated model? Why would this p-value also be uniformly distributed? Remember in the correlated model, there is a true moderate relationship between X2 and Y (about r = .42). So we should see a large percentage of simulations detecting this relationship, yielding many p-values close to 0.

Shockingly, we don’t see anything near this. In fact, only *4.84%* of simulations in this model detected a significant X2-Y relationship. This speaks volumes to the not so desirable effects of multicollinearity. Remember in only 33.67% of simulations did the correlated-predictors model even pick up the X1-Y relationship. I hinted that maybe the remaining ~66% cycles would detect a significant X2-Y relationship. But nope. Just 4.84%!

Overall this suggests that with multicollinearity the estimates of X1 and X2 are not just unstable with regards to formulating a consistent model for predicting Y, but are in fact causing a decline in our linear regression model performance. Indeed, most often the estimates are not reflecting that any relationship exists between predictors and Y (despite there being two moderate relationships in the data). In addition, we see a huge spike in the variability of our estimates with standard errors of both X1 and X2 increasing by a factor of 3.

*Conclusions of the simulation*

Overall, the model with multicollinearity completely missed the X2-Y relationship (only picking it up about 5% of the time, reflecting about Type I error levels). And more importantly it missed the X1-Y relationship an abysmal ~66% of the time. This is due to the fact that compared to the linearly independent model, it loses precision on the X1 and X2 estimates by a factor of 3. Overall these results are consistent with what theory and folk wisdom about multicollinearity would suggest. The simulation showed relative instability in the slope coefficient estimates and worse performance with (vs without) multicollinearity.

*Solutions*

While you may be looking for a solution to multicollinearity, the reality is there aren’t any guaranteed 100% correct ways to handle it. The most common solution is to just remove all but one predictor in the set of correlated predictors. Which predictor you decide to keep depends on your particular problem domain, the theory you are working with, the ease of collecting data on that predictor, etc. Another solution I recommend is to simply collect more data. The standard errors will still be poorer with (vs without) multicollinearity, but as with almost all data work, your models should give you appreciably better performance with larger sample sizes.

*A final note*

As a concluding note, I want to emphasize, of course, that the effects of multicollinearity are related to the size of the interpredictor correlation. This simulation was run with X1 and X2 correlated at .95.

So as a final experiment, I re-ran the 10,000 cycle simulation at 40 different levels of correlation between X1-X2. In the following two graphs, you can thus see how the effects of multicollinearity differ at varying levels of the X1-X2 linear dependency.

You can see from the graph above the effects of multicollinearity are quite drastic when X1-X2 are strongly correlated (.8 or above), but become increasingly small after that. Now that said, it is worth noting that the minimum SE in this simulation (at r = .15) was still larger (SE = .2062) than that obtained in the uncorrelated model (SE = .2039; pictured as the red dot). Obviously while this difference seems objectively small, it could matter in applications at scale.

In another graph (below), I have plotted the proportion of cycles in which the X1-Y relationship was determined to be statistically significant across varying degrees of multicollinearity. The red vertical line represents the uncorrelated model baseline percentage (.9965). You can see the same general pattern as above. Namely, there are big effects with multicollinearity at r = .8 or higher that drop off appreciably after that.

Overall these results are consistent with theory and common folk wisdom of multicollinearity. I hope you found this blogpost informative. Feel free to tinker with the code and adjust the correlation strengths, sample sizes, etc. to test it out for yourself.

Let me know in the comments what you think! And please share any experiences that you may have had with multicollinearity!