Multicollinearity – in simple terms – is the state of having linear dependence among your predictors in a generalized linear model (such as linear regression). Basically, it occurs when any of your predictors are correlated with other predictors in the model. It should be avoided when possible. For details why multicollinearity can wreak havoc on your inference, you can check out this post.

You may now be saying to yourself, ‘OK, so I know that multicollinearity is to be avoided, but how can I first *detect* multicollinearity?’ In this post we will spotlight 3 ways to diagnose multicollinearity in your data.

*1. Correlation matrix*

*1. Correlation matrix*

The first (but least effective) way to test for multicollinearity is to simply observe a correlation matrix of the predictors. If any of the predictors are strongly correlated with other predictors (say at .8 or .9) you have multicollinearity. In this case, you may consider dropping one or more of those variables from your model.

*Calculating the Correlation Matrix in R*

*Calculating the Correlation Matrix in R*

In R calculating the correlation matrix is just one line of code. You simply use the cor() function and input a matrix or data frame object of your predictors.

Here I’ve highlighted the correlations that indicate clear multicollinearity.

**A cautionary note:** this method is quick and easy, but potentially not sensitive to real multicollinearity. It is entirely possible that each individual pairwise correlation could not be large, but that there could still be linear dependence among 3 or more predictors. In this case, we would need better diagnostic tools. Such as…

*2. Variance Inflation Factors (VIF)*

A second (and more thorough) method is to use Variance Inflation Factors (VIFs). **These will tell you how much the variance of each coefficient is inflated.** If you recall, the standard errors of our coefficient estimates become too large when there is multicollinearity (see this post for an empirical simulation in R). So, that inflated variance is what the ‘V’ in the VIF refers to.

To calculate VIF’s for each predictor, you build a separate regression model for each predictor, regressing that particular predictor on all the other predictors. Then you simply find the R^2 from that model, and your VIF will be a ratio of (1/1-R^2).

Now that we’ve seen how to calculate VIF, let’s think conceptually about what VIF represents. R^2 is the percentage of variance explained by the model, so (1-R^2) is just the leftover (due to residual) variance. This means that if our predictor is linearly dependent on a combination of other predictors, we would obtain a high R^2 on that regression model. This leads to the conclusion that 1/(1- high R^2) would yield a large VIF.

**Hence, large values of VIF indicate multicollinearity.** Typically you want to be cautious when you see VIFs that are above 10. I personally am cautious of any VIFs 5 or above.

*Calculating VIF in R*

*Calculating VIF in R*

In R, calculating VIF is simple. You just install the car package and use its built-in vif function. It accepts your original linear model (Y regressed on all X’s) as its sole parameter.

Now based on the VIF you can filter out variables as needed.

*3. Condition Index/Condition Number*

The third and final diagnostic tool for multicollinearity is the Condition Index (and related Condition). This is my personal favorite just because it involves some fun matrix operations. To get the condition index, first you perform an eigendecomposition on the correlation matrix of just your predictors. Then for each eigenvalue, you find the ratio of the maximum eigenvalue to that particular eigenvalue. Then you take the square root of that. And viola — your result is the value for the condition index. Thus, each eigenvalue will have an associated score for the condition index.

The condition number is simply the maximum value of the condition index. **Condition numbers between 30-100 indicate multicollinearity.**

*Calculating Condition Index/Condition Number in R*

*Calculating Condition Index/Condition Number in R*

To perform this in R, use the base eigen() function and perform simple calculations with the resulting eigenvalues.

- Pass in a correlation matrix to the eigen() function
- Calculate the condition index
- Calculate the condition

Here our condition is well above 100, so it is a definite sign of multicollinearity. Again, we want our condition to be below 30 if it is possible.

*Conclusion*

In sum, multicollinearity can impact our inference unfavorably. But knowing how to detect it can empower you to build a better linear model. In this post we learned 3 ways to detect multicollinearity. Ultimately having these tools in your kit will take your analytics skills to the next level. I hope that you have learned something useful from this post.

Do you have 30 seconds? If so please take some time to leave a comment below!