Multicollinearity and the precision of regression coefficient estimates

At some point in your life as an analyst you may hear that multicollinearity should be avoided when possible. But many of us were not taught exactly what multicollinearity is, why it is a problem, how we can detect it, and ultimately what are some solutions to it. This blog post will gives you answers to these questions and will show you empirically the effects of multicollinearity with a simulation in R.

What is multicollinearity?

Multicollinearity is the state of there being linear dependence among two or more predictors in a regression model (or any forms of the generalized linear model like logistic regression, ANOVA, etc.). In plain English, if your predictors are strongly correlated with one another, you’ve got multicollinearity. Typically a correlation coefficient of .8 or .9 among predictors indicates multicollinearity.

Why is it a problem?

In order to understand why multicollinearity is a problem, first let’s recall the interpretation of our slope estimates in multiple regression. You want to know the effect (on average) that a 1-unit change in each of your predictor variables has on another variable (the outcome). Suppose you were modeling housing prices (Y) from two predictor variables, 1) square footage of the house (X1) and 2) the nearby school quality (X2). A multiple regression model can tell us whether these variables are linearly related to the outcome. In addition, it would indicate the strength of these relationships, as represented by the size of our slopes i.e. our regression coefficient estimates.

Multicollinearity is a problem because having linear dependence among predictors can cause instability in the estimates of our slope coefficients. This means the associated t-statistics and p-values for those slope estimates are dubious. In our previous example with real estate, if you were considering to add a room to your house to increase the square footage, the best model would tell you precisely by how much you can expect the price of your house to increase. But to the extent there is multicollinearity in the predictors, our confidence in that estimate is undermined.

So where does this lack of estimate precision come from? The nuts and bolts of it is that it all has to do with instability in the design matrix when computing a least squares regression line. Below is a set of matrices and some made up data:

  •  A 5×5 matrix X representing 5 observations of 5 parameters (the intercept X0 and four predictors X1, X2, X3, and X4)
  •  The transpose of X
  •  A 5×1 matrix Y representing the 5 observations’ responses on the outcome variable

multicollinearity_matrices

The equation for calculating the slope coefficients in a linear regression model is shown below:

normal_eq

Basically you can see this equation inverts the product of X and X-transposed and then multiplies it by the product of X-transposed and Y. This is how we calculate our OLS line of best fit with an intercept and regression coefficients. And the reason why multicollinearity will muck up this formulation (for reasons we won’t detail too much) is that the inverse of a square matrix can only exist if all the columns are linearly independent. So if there is linear dependence, it becomes impossible to carry out that inversion part of the equation.

So, this is why perfect multicollinearity will negate the least squares calculation. But in the real world perfect linear dependence is rarely an issue. Yet we still hear about multicollinearity, usually when predictors are maybe not perfectly linear dependent, but are still something close to it (like correlated at .9)? Here, the matrix inversion can occur, and the equation will generate the slope coefficients, but those estimates will end up being unstable and lacking precision (if you are interested in knowing why, check out this page). Basically as the correlation of two variables goes to 1, the variance of the first slope coefficient approaches infinity. Practically, this means your slope estimates will have higher standard error, which means smaller t-statistics, which means many Type II errors, or failing to detect an effect that exists. But instead of me just saying this, let’s show it with a simulation in R!

Simulating the effects of multicollinearity in R

Let’s now show the effects of multicollinearity empirically in R. First we will set a repetition counter to 10,000 cycles and initialize two empty data frames, 1) for storing the results of a linearly independent model (without multicollinearity) and 2) for storing results of a linearly dependent model (with multicollinearity). We will investigate how multicollinearity affects the p-values and standard errors of our regression slope coefficients. If our simulation support theory, we should see the multicollinearity model yielding larger standard errors, more fluctuation in p-values, and fewer detected effects.

For each of the 10,000 iterations we will simulate two models and store the results in their respective data frames:

Model 1 (independent-predictors model) will be a model with 3 linearly independent predictors and X1 moderately correlated to Y (r = ~.45). Model 2 (correlated-predictors model) will be the same model, but we’ll adjust X2 to be correlated with X1 at r = ~.95 (this will inject some multicollinearity in it).

set.seed(051617)
reps = 10000


# Make two data frames for storing results
df.structure <- data.frame(x1_pval = numeric(reps), x1_se = numeric(reps),
                 x2_pval = numeric(reps), x2_se = numeric(reps),
                 x3_pval = numeric(reps), x3_se = numeric(reps),
                 cor_x1x2 = numeric(reps), cor_x1y = numeric(reps))

uncorrelated_mod_dump <- df.structure
correlated_mod_dump <- df.structure


# Make column indices for easier storage of results
p.val.col.i <- grep('pval', colnames(df.structure))
se.col.i <- grep('se', colnames(df.structure))


# Populates data frames with model output
PopulateModDump <- function(mod.sum, dump, p.val.col.i, se.col.i){
        dump$cor_x1y[i] = cor(x1,y)
        dump$cor_x1x2[i] = cor(x1,x2)
        dump[i, p.val.col.i] = mod.sum$coefficients[,4][2:4]
        dump[i, se.col.i] = mod.sum$coefficients[,2][2:4]
        
        return(dump)
}


# Run the simulation 10000 times
for(i in 1:reps){
        #linearly independent predictors with x1 predicting y
        x1 <- rnorm(100, 2, 3)
        x2 <- rnorm(100, 2, 3)
        x3 <- rnorm(100, 2, 3)
        y <- x1 + rnorm(100, 5, 6)
        
        lm_uncorrelated <- summary(lm(y ~ x1 + x2 + x3))
        uncorrelated_mod_dump <- PopulateModDump(lm_uncorrelated, uncorrelated_mod_dump, 
                                                 p.val.col.i, se.col.i)
        
        #correlate x2 with x1 and rerun model
        x2b <- x1 + rnorm(100, 0, 1)
        lm_correlated <- summary(lm(y ~ x1 + x2b + x3))
        
        correlated_mod_dump <- PopulateModDump(lm_correlated, correlated_mod_dump, 
                                                 p.val.col.i, se.col.i)
}

Now let's check our results. First let's see what proportion of simulations detected a significant relationship between X1 and Y. Remember, this correlation was exactly the same in both models at about r = .45.

pvals

With the uncorrelated model 99.65% of the simulations detected the significant relationship between X1 and Y. In the correlated (with multicollinearity) model this percentage dropped to an abysmal 33.79%.

One may say, though, that perhaps this low rate is just explained by X2 being allocated all of the variance that it shared with X1. If so, most of the simulations of the correlated model should pick up a significant X2-Y relationship. This would suggest the correlated-predictors model wasn’t necessarily performing worse overall, even if it was not detecting a significant X1 slope. Perhaps it was just vacillating between X1 and X2 when selecting the variable that was predominantly related to Y. So to investigate this, let’s see what the simulation yields for estimates across all coefficients.

For each model we’ll aggregate the results across 10,000 trials and check the X1, X2, and X3 slope coefficients’ mean p-values and standard errors. If multicollinearity was just causing fluctuation in selecting the most significant predictor, we should see a combined total of ~99.65% of simulations yielding significant p-values (a match to the uncorrelated predictors model). But if the model actually get worse overall with multicollinearity, we should see a reduced proportion of statistically significant relationships (of both X1-Y and X2-Y combined). In addition, just to double check that our simulation worked as intended, we’ll check the correlations between X1 and X2 and between X1 and Y.

round(apply(uncorrelated_mod_dump, 2, mean), digits = 4)
round(apply(correlated_mod_dump, 2, mean), digits = 4)

Table 1. Aggregated Values Across 10,000 Simulations

X1 p-value X1 S.E. X2 p-value X2 S.E. X3 p-value X3 S.E. cor(X1,X2) cor(X1,Y) cor(X2,Y)
Uncorrelated-predictors Model 0.0010 0.2039 0.4954 0.2040 0.5000 0.2041 -0.0002 0.4456 -0.0009
Correlated-predictors Model 0.2376 0.6455 0.5010 0.6126 0.4997 0.2041 0.9483 0.4456 0.4229

Remember X1 was moderately correlated to Y at r=.45. Notice in the uncorrelated-predictor model (no multicollinearity) the average p-value across 10,000 simulations was .001, indicating it was able to pick up this linear dependence. But in the correlated-predictors (multicollinearity) model the average p-value for the X1-Y relationship jumps to .2376, suggesting a wider range of outcomes/p-values. Indeed, the SD of p-values across 10,000 simulations was .267, whereas it was just .009 in the uncorrelated-predictors model. You can also see the standard error of the X1 coefficient estimate increases by a factor of 3. Just as theory would predict with multicollinearity we are seeing more instability (less precision) in our estimates.

Interestingly, too, For X2 and X3, the mean p-values are about .5. This indicates they are likely uniformly distributed (the expected distribution under the null). For X3 this makes perfect sense. Across both models X3 is truly uncorrelated with Y.

But what about X2 in the correlated model? Why would this p-value also be uniformly distributed? Remember in the correlated model, there is a true moderate relationship between X2 and Y (about r = .42). So we should see a large percentage of simulations detecting this relationship, yielding many p-values close to 0.

Shockingly, we don’t see anything near this. In fact, only 4.84% of simulations in this model detected a significant X2-Y relationship. This simulation speaks volumes to problematic effects of multicollinearity. Remember in only 33.67% of simulations did the correlated-predictors model even pick up the X1-Y relationship. I hinted that maybe the remaining ~66% cycles would detect a significant X2-Y relationship. But nope. Just 4.84%!

Overall this suggests that with multicollinearity the estimates of X1 and X2 are unstable and result in a decline in our linear regression model performance. Indeed, in most iterations of the simulation, the estimates were not reflecting that any relationship exists between predictors and Y (despite there being two moderate relationships in the data). In addition, we see a huge spike in the variability of our estimates, with standard errors of both X1 and X2 increasing by a factor of 3.

Conclusions of the simulation

Overall, the model with multicollinearity completely missed the X2-Y relationship (only picking it up about 5% of the time, reflecting about Type I error levels). And more importantly it missed the X1-Y relationship an abysmal ~66% of the time. That shows the inflated Type II error rate with multicollinearity. This is due to the fact that compared to the linearly independent model, it loses precision on the X1 and X2 estimates by a factor of 3. Overall these results are consistent with what theory about multicollinearity would suggest. The simulation showed relative instability in the slope coefficient estimates and worse performance with (vs without) multicollinearity.

Solutions

While you may be looking for a solution to multicollinearity, the reality is there aren’t any guaranteed 100% correct ways to handle it. The most common solution is to just remove all but one predictor in the set of correlated predictors. Which predictor you decide to keep depends on your particular problem domain, the theory you are working with, the ease of collecting data on that predictor, etc. Another solution I recommend is to simply collect more data. The standard errors will still be poorer with (vs without) multicollinearity, but as with almost all data work, your models should give you appreciably better performance with larger sample sizes.

A final note

As a concluding note, I want to emphasize, of course, that the effects of multicollinearity are related to the size of the interpredictor correlation. This simulation was run with X1 and X2 correlated at .95.

So as a final experiment, I re-ran the 10,000 cycle simulation at 40 different levels of correlation between X1-X2. In the following two graphs, you can thus see how the effects of multicollinearity differ at varying levels of the X1-X2 linear dependency.

multicol-estimate-precision

You can see from the graph above the effects of multicollinearity are quite drastic when X1-X2 are strongly correlated (.8 or above), but become increasingly small after that. Now that said, it is worth noting that the minimum SE in this simulation (at r = .15) was still larger (SE = .2062) than that obtained in the uncorrelated model (SE = .2039; pictured as the red dot). Obviously this difference could matter in applications at scale.

In another graph (below), I have plotted the proportion of cycles in which the X1-Y relationship was determined to be statistically significant across varying degrees of multicollinearity. The red vertical line represents the uncorrelated model baseline percentage (.9965). You can see the same general pattern as above. Namely, there are big effects with multicollinearity at r = .8 or higher that drop off appreciably after that.

multicol-ability-detect-effect

I hope you found this blogpost informative. Feel free to tinker with the code and adjust the correlation strengths, sample sizes, etc. to test it out for yourself.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s