3 Diagnostics for Multicollinearity Explained

Multicollinearity – in simple terms – is the state of having linear dependence among your predictors in a generalized linear model (such as linear regression). Basically, it occurs when any of your predictors are correlated with other predictors in the model. It should be avoided when possible. For details why multicollinearity can wreak havoc on your inference, you can check out this post.

You may now be saying to yourself, ‘OK, so I know that multicollinearity is to be avoided, but how can I first detect multicollinearity?’ In this post we will spotlight 3 ways to diagnose multicollinearity in your data.

1. Correlation matrix

The first (but least effective) way to test for multicollinearity is to simply observe a correlation matrix of the predictors. If any of the predictors are strongly correlated with other predictors (say at .8 or .9) you have multicollinearity. In this case, you may consider dropping one or more of those variables from your model.

Calculating the Correlation Matrix in R

In R calculating the correlation matrix is just one line of code. You simply use the cor() function and input a matrix or data frame object of your predictors.

Here I’ve highlighted the correlations that indicate clear multicollinearity.

A cautionary note: this method is quick and easy, but potentially not sensitive to real multicollinearity. It is entirely possible that each individual pairwise correlation could not be large, but that there could still be linear dependence among 3 or more predictors. In this case, we would need better diagnostic tools. Such as…

2. Variance Inflation Factors (VIF)

A second (and more thorough) method is to use Variance Inflation Factors (VIFs). These will tell you how much the variance of each coefficient is inflated. If you recall, the standard errors of our coefficient estimates become too large when there is multicollinearity (see this post for an empirical simulation in R). So, that inflated variance is what the ‘V’ in the VIF refers to.

To calculate VIF’s for each predictor, you build a separate regression model for each predictor, regressing that particular predictor on all the other predictors. Then you simply find the R^2 from that model, and your VIF will be a ratio of (1/1-R^2).

Now that we’ve seen how to calculate VIF, let’s think conceptually about what VIF represents. R^2 is the percentage of variance explained by the model, so (1-R^2) is just the leftover (due to residual) variance. This means that if our predictor is linearly dependent on a combination of other predictors, we would obtain a high R^2 on that regression model. This leads to the conclusion that 1/(1- high R^2) would yield a large VIF.

Hence, large values of VIF indicate multicollinearity. Typically you want to be cautious when you see VIFs that are above 10. I personally am cautious of any VIFs 5 or above.

Calculating VIF in R

In R, calculating VIF is simple. You just install the car package and use its built-in vif function. It accepts your original linear model (Y regressed on all X’s) as its sole parameter.

Now based on the VIF you can filter out variables as needed.

3. Condition Index/Condition Number

The third and final diagnostic tool for multicollinearity is the Condition Index (and related Condition). This is my personal favorite just because it involves some fun matrix operations. To get the condition index, first you perform an eigendecomposition on the correlation matrix of just your predictors. Then for each eigenvalue, you find the ratio of the maximum eigenvalue to that particular eigenvalue. Then you take the square root of that. And viola — your result is the value for the condition index. Thus, each eigenvalue will have an associated score for the condition index.

The condition number is simply the maximum value of the condition index. Condition numbers between 30-100 indicate multicollinearity.

Calculating Condition Index/Condition Number in R

To perform this in R, use the base eigen() function and perform simple calculations with the resulting eigenvalues.

  1. Pass in a correlation matrix to the eigen() function
  2. Calculate the condition index     
  3. Calculate the condition  cond_num

Here our condition is well above 100, so it is a definite sign of multicollinearity. Again, we want our condition to be below 30 if it is possible.


In sum, multicollinearity can impact our inference unfavorably. But knowing how to detect it can empower you to build a better linear model. In this post we learned 3 ways to detect multicollinearity. Ultimately having these tools in your kit will take your analytics skills to the next level. I hope that you have learned something useful from this post.

Do you have 30 seconds? If so please take some time to leave a comment below!


A Data Science Technique for Cleaner Data Preparation

As data scientists, so much of our time is spent on cleaning and preparing our data. The analysis and model building may be the most fun and insightful part, but before we can do that, we have to ensure that we have quality data coming in.

What I have seen in my experience is that most statisticians and analysts will do all of their cleaning, prep work, and the analysis itself on a single-process script that reads from top to bottom. While this is certainly one option for us in how we do our work, it often leads to messy, unmaintainable code that is difficult to communicate with other members of the team. A key goal of any team of analysts/programmers is learning to communicate well. And if code is our primary language, we really ought to make sure that we are coding with the best data cleaning practices.

In my job as a data scientist, I have had the privilege of working with more senior developers who have passed on lots of great lessons to me. I’ve applied some of these practices to my life as an analyst, in particular, regarding how I clean and prepare data. Today I am going to pass on one of those lessons to you. Namely, we’ll see how building a class in Python can help to organize our data preparation.

First we’ll explore how most analysts would perform a data cleaning process. Then we’ll observe a better way.

Method 1 (BAD): Writing a Single Process in a Script for Data Preparation

The following example shows how the typical analyst might prep data. This method is faster to code in the short-run, but will be slower and more cumbersome in the long-run. 

import pandas as pd
# Data Cleaning

# Create data frame
df = pd.DataFrame({
        'string_col': ['A', 'B', 'C'],
        'bool_col1' : [False, True, True],
        'int_col' : [1, 2, 3],
        'bool_col2' : [True, True, False]}

# Set boolean variables to process
vars_to_process = list(df.columns[df.dtypes == 'bool'])

# If there are variables to process, do some processing.
# Here, we just set the variables to 12345 for illustration purposes.
if(len(vars_to_process) > 0):
    df[vars_to_process] = 12345

# Set a sorting column
sorting_col = 'string_col'

# If there is a sorting column, prep it.
if(sorting_col != None):
# To prep our sorting column, we lowercase values and then triple them
    df[sorting_col] = [x.lower() for x in df[sorting_col]]
    df[sorting_col] = [x*3 for x in df[sorting_col]]

Notice, the above code above does work. And to most people that is good enough. It is certainly not a bad start. Most analysts and statisticians would see code like this and wouldn’t bat too much of an eye.

However, notice a few issues with it:

1) The sorting column (sorting_col) is just kind of hanging out in the middle of the script. If one wanted to change that value in the future or use it again later in the script, they’d have to navigate through the code to find that global variable before they could reinitialize it. They’d also have to remind themselves of what it represented, which requires reading it again. This would be quite lengthy and annoying. And it would definitely make the program more vulnerable to human error. This is true even for the person who wrote the script, but especially for other members on the team who may be reading the code.

2) Also, note that no parts of the above process above were encased in a function. This means if an error occurred, we’d have no good way of handling that error or debugging our process. In addition, it requires people know that checking “if the length of vars_to_process is greater than 0” means it is checking whether the list of variables is empty. It would be nicer to explicitly spell this out with the code itself rather than having to manually add a comment.

3) Finally, if we wanted to use the sorting column variable later in the analysis, we may forget that it belongs to the data frame or is a meaningful attribute of it. To that end, it would be nice to have a way to keep it conceptually “linked” to the data frame.

In sum, though it may be faster to do our analysis like this, in the long-term these problems that we have identified could lead to lost time and work.

Now we’ll look at a technique that can make our code more maintainable, less prone to error, and easier to understand.

Method 2 (GOOD): Writing a Class for Data Preparation

The following example shows an improvement on our previous method. This shows the value of being a statistician that can think like a developer (i.e. this makes for better data scientists). With re-use this method would yield long-term returns on your time and productivity.

class DataPrepper():

    def __init__(self, df, vars_to_process=[], sorting_col=None):

        self.df = df



    def there_are_vars_to_process(self):
         return (len(self.vars_to_process) > 0)

    def process_vars(self):
    """To process a variable, set each value to 12345."""
        self.df[self.vars_to_process] = 12345

    def there_is_a_sorting_col(self):
        return (self.sorting_col)

    def process_sorting_col(self):
    """To process a sorting column, first lowercase values and then triple them."""

    def lowercase_vals(self):
        lower_vals = [x.lower() for x in self.df[self.sorting_col]]
        self.df[self.sorting_col] = lower_vals

    def triple_vals(self):
        triple_vals = [x*3 for x in self.df[self.sorting_col]]
        self.df[self.sorting_col] = triple_vals

Notice we now have a class with variables and functions that are centered on data preparation. And note how it has solved each of our problems from before:

  1. It has a few instance variables, such as sorting_col, that we can use inside any functions in our process. In addition, per convention, they are located at the top of the class, meaning other programmers can easily access them if they need to.
  2. We also can see how these functions make our process much more readable and maintainable. Other members of the team can read this and generally understand the nature of the process. The code now reads almost like a story (e.g., ‘If there are variables to process, then process the variables.’). By the way, I picked up this tip from reading Robert Cecil Martin’s “Clean Code” (which I highly recommend). And now where applicable I always program with this level of readability.
  3.  Finally, because the sorting column (sorting_col) is stored as an attribute of the prepared data, if we ever do need it in the future, we can access that attribute easily and know that is an associated component of the data preparation process.

For each of these reasons, our code is now less prone to error, more maintainable, and easier to communicate to our team.

Now preparing the data is as easy as creating a DataPrepper object.

prepped = DataPrepper(df=df,
                      vars_to_process=['bool_col1', 'bool_col2'],


Concluding notes

As an analyst, there is so much more to our job than just crunching numbers and reporting results. Spending more time initially to develop better data preparation techniques will pay long-term dividends to you, your team, and ultimately your organization.

Note: while the above technique improved our process, of course there are additional enhancements that we can make. In future posts, we will learn about the value of using try/catch blocks, using configuration files to specify our parameters, and some conventions for naming variables and functions. All in all I will continue to spotlight great tips that will make you a better data scientist!

Linear Regression Simplified: A Visual Tutorial with a Lab in R

Linear regression is perhaps the most widely used statistical tool in data science. It is a fundamental building block for statistical inference and prediction, and many more advanced methods derive from simple linear regression. This blogpost will walk you through the basic concept of linear regression. It will also provide a code tutorial to perform a linear regression in R.

By the end of this blogpost you will know:

  1. What linear regression is
  2. How linear regression works
  3. How to perform linear regression in R and how to interpret its output

What is linear regression?

Linear regression is a statistical method that models the relationship between two continuous quantitative variables.

As you can see below, we are investigating the relationship between car weight and gas mileage. It appears that a downward sloping line describes this relationship. Generally, as car weight increases, gas mileage decreases.


Now, just eye-balling this graph, one can see that the line does a pretty good job of capturing the overall downwardtrend”. However, it is also clearly an imperfect model of the data (as all models by definition are). For example, many points are above the line and many are also below it. In fact, the line doesn’t pass through most of the data points. This means there is also some “scatter” or “error” about the line.

The goal of linear regression is to fit a line to the data that minimizes the scatter.


How does linear regression work?

Linear regression works by finding a line of best fit to pass through the data. Infinitely many lines can be fit to the data, but the best fitting line (called the “line of best fit”) will be the one that minimizes scatter. The metric that quanitifes scatter is called the sum of squared residuals (aka sum of squared error or SSE). 

Each data point has a corresponding “residual,” which is just a quantity representing the vertical distance between it and the line of best fit. In the plot below all the residuals are demarcated by black lines. You can see each residual represents how far away is the data point from its estimated point on the line. In other words, the residual represents how much ‘error’ that data point contributes to the model.


A goal of linear regression is to minimize the overall model error. This should make intuitive sense. The less error there is in a model, the better it is fitting to the data. Therefore, we want to treat both positive residuals (those above the line) and negative residuals (those below the line) as equally ‘bad’. So we square each residual value and sum them up. The line of best fit is the line that minimizes the sum of these squared residuals.

You can see the line of best fit above for estimating gas mileage in mpg is calculated to be: 37.3-5.34*car weight(in 1000’s of pounds)

This equation tells us that any car’s gas mileage (mpg) is predicted to be 37.3 – 5.34its weight in thousands of pounds. So if a car weighs 3000 pounds, we would estimate its mpg to be (37.3-5.343) = 21.28.

You may be wondering how the linear regression model settled on that equation. Under the hood, the solution involves matrix algebra and other fancy inversion operations. We won’t go into these details here, but there is a closed-form analytic solution. Luckily in practice computers will perform this for us, as it is painful to do by hand.

Now let’s learn how to build our own linear regression model in R. We’ll walk through it step-by-step.

Applying Linear Regression: A Lab

For this lab we will use the stock data set mtcars in R. Load up the data and let’s check out the variables.



As you can see this data set contains information about cars, including their miles per gallon (mpg) weight (wt), number of cylinders (cyl), etc.

We will build a simple linear regression model that predicts mpg with wt.

mod <- lm(mpg ~ wt, data = mtcars)

Notice we have to specify three parameters:

  1. The outcome variable to be predicted (mpg; sometimes called the dependent variable)
  2. The predictor variable (wt; sometimes called the independent variable)
  3. The data set (in this case mtcars).

Now let’s check the results!



You can see the summary statistics from our model above.

The Estimate column gives us the estimates of our intercept and slopes on our line of best fit. Recall from earlier that our line of best fit for predicting mpg from wt was given by 37.3 – 5.34*wt.

Regarding the intercept, it is mostly just there to fit the line. It technically means ‘the estimated value of Y when X = 0), but in most regression contexts that doesn’t translate into anything of theoretical interest. Sometimes it flat out doesn’t even make sense. For example, having negative values of weight is impossible, but such negative intercepts can often show up just to stabilize the line.

The interpretation for the predictor slope coefficient is that for every 1-unit increase in car weight (in this context 1 unit =1000 pounds), that car’s gas mileage decreases by 5.34 mpg. It decreases because the coefficient is negative. If it were positive, the translation would change to “increases.”

The Standard Error column tells us how much precision we had in estimating the respective terms. Generally if there is wider scatter there will be a higher standard error. This makes sense visually. See below two models:


The model in the left pane is the model we just fit. The model in the right pane is a model I just made up (which is just wt predicting 3*mpg). You can see there are larger residuals in this second model. This model therefore has less precision in its estimate of the line of best fit, and consequently higher standard error. Generally, you want the data points tightly hugging the line (like what can be seen in the left pane). When data points are further away from the line it indicates less precision.

So standard error is a quantity representing the precision in our estimates. So you don’t have to scroll, here is the same table of summary statistics as above:


The t-value gives us a test statistic for our intercept and slope estimates. It represents how many standard errors away from 0 is that obtained value on a null distribution. Though we have to sacrifice some detail here to make this explanation intuitive, you can crudely think of the t value as a measurement of the size of the effect (in particular, the size of the effect in terms of standard errors).

A more intuitive metric to understand (that is directly related to the t-statistic) is the p-value. The p-value basically tell us how ‘surprising’ our obtained results are, or how strong is the evidence of a real effect. We generally want small p-values. P-values below .05 indicate that an effect is statistically significant. Such a p-value can be translated as ‘there is less than a 5% chance of obtaining the results we did if there were truly no effect.’ In our output above, the p-value column indicates that wt is significantly related to mpg.

Notice too, both Residual standard error and Multiple R-squared are metrics for judging how well our line fits the data. Generally lower residual standard error and higher R-squared indicate a better fit to the data. R squared tells you what proportion of the overall variance is explained by the model. The mathematics are pretty simple. You can look them up in your spare time.

That sums up the output in R. Another cool trick in R Studio is to type the $ symbol after your model object. It will generate a list of summary statistics from your model that you can explore further.


If you want to check out the coefficients, residuals, fitted values, etc., you can do that.

This can really come in handy if you want to use any of these values elsewhere in your analysis. For example, if you need to use the sum of squared residuals, you can do that by typing:



Congratulations! You are now officially educated in linear regression! You know what linear regression is, what it is used for, how it works, and also how to do it in R.

Multicollinearity and the precision of regression coefficient estimates

At some point in your life as an analyst you may hear that multicollinearity should be avoided when possible. But many of us were not taught exactly what multicollinearity is, why it is a problem, how we can detect it, and ultimately what are some solutions to it. This blog post will gives you answers to these questions and will show you empirically the effects of multicollinearity with a simulation in R.

What is multicollinearity?

Multicollinearity is the state of there being linear dependence among two or more predictors in a regression model (or any forms of the generalized linear model like logistic regression, ANOVA, etc.). In plain English, if your predictors are strongly correlated with one another, you’ve got multicollinearity. Typically a correlation coefficient of .8 or .9 among predictors indicates multicollinearity.

Why is it a problem?

In order to understand why multicollinearity is a problem, first let’s recall the interpretation of our slope estimates in multiple regression. You want to know the effect (on average) that a 1-unit change in each of your predictor variables has on another variable (the outcome). Suppose you were modeling housing prices (Y) from two predictor variables, 1) square footage of the house (X1) and 2) the nearby school quality (X2). A multiple regression model can tell us whether these variables are linearly related to the outcome. In addition, it would indicate the strength of these relationships, as represented by the size of our slopes i.e. our regression coefficient estimates.

Multicollinearity is a problem because having linear dependence among predictors can cause instability in the estimates of our slope coefficients. This means the associated t-statistics and p-values for those slope estimates are dubious. In our previous example with real estate, if you were considering to add a room to your house to increase the square footage, the best model would tell you precisely by how much you can expect the price of your house to increase. But to the extent there is multicollinearity in the predictors, our confidence in that estimate is undermined.

So where does this lack of estimate precision come from? The nuts and bolts of it is that it all has to do with instability in the design matrix when computing a least squares regression line. Below is a set of matrices and some made up data:

  •  A 5×5 matrix X representing 5 observations of 5 parameters (the intercept X0 and four predictors X1, X2, X3, and X4)
  •  The transpose of X
  •  A 5×1 matrix Y representing the 5 observations’ responses on the outcome variable


The equation for calculating the slope coefficients in a linear regression model is shown below:


Basically you can see this equation inverts the product of X and X-transposed and then multiplies it by the product of X-transposed and Y. This is how we calculate our OLS line of best fit with an intercept and regression coefficients. And the reason why multicollinearity will muck up this formulation (for reasons we won’t detail too much) is that the inverse of a square matrix can only exist if all the columns are linearly independent. So if there is linear dependence, it becomes impossible to carry out that inversion part of the equation.

So, this is why perfect multicollinearity will negate the least squares calculation. But in the real world perfect linear dependence is rarely an issue. Yet we still hear about multicollinearity, usually when predictors are maybe not perfectly linear dependent, but are still something close to it (like correlated at .9)? Here, the matrix inversion can occur, and the equation will generate the slope coefficients, but those estimates will end up being unstable and lacking precision (if you are interested in knowing why, check out this page). Basically as the correlation of two variables goes to 1, the variance of the first slope coefficient approaches infinity. Practically, this means your slope estimates will have higher standard error, which means smaller t-statistics, which means many Type II errors, or failing to detect an effect that exists. But instead of me just saying this, let’s show it with a simulation in R!

Simulating the effects of multicollinearity in R

Let’s now show the effects of multicollinearity empirically in R. First we will set a repetition counter to 10,000 cycles and initialize two empty data frames, 1) for storing the results of a linearly independent model (without multicollinearity) and 2) for storing results of a linearly dependent model (with multicollinearity). We will investigate how multicollinearity affects the p-values and standard errors of our regression slope coefficients. If our simulation support theory, we should see the multicollinearity model yielding larger standard errors, more fluctuation in p-values, and fewer detected effects.

For each of the 10,000 iterations we will simulate two models and store the results in their respective data frames:

Model 1 (independent-predictors model) will be a model with 3 linearly independent predictors and X1 moderately correlated to Y (r = ~.45). Model 2 (correlated-predictors model) will be the same model, but we’ll adjust X2 to be correlated with X1 at r = ~.95 (this will inject some multicollinearity in it).

reps = 10000

# Make two data frames for storing results
df.structure <- data.frame(x1_pval = numeric(reps), x1_se = numeric(reps),
                 x2_pval = numeric(reps), x2_se = numeric(reps),
                 x3_pval = numeric(reps), x3_se = numeric(reps),
                 cor_x1x2 = numeric(reps), cor_x1y = numeric(reps))

uncorrelated_mod_dump <- df.structure
correlated_mod_dump <- df.structure

# Make column indices for easier storage of results
p.val.col.i <- grep('pval', colnames(df.structure))
se.col.i <- grep('se', colnames(df.structure))

# Populates data frames with model output
PopulateModDump <- function(mod.sum, dump, p.val.col.i, se.col.i){
        dump$cor_x1y[i] = cor(x1,y)
        dump$cor_x1x2[i] = cor(x1,x2)
        dump[i, p.val.col.i] = mod.sum$coefficients[,4][2:4]
        dump[i, se.col.i] = mod.sum$coefficients[,2][2:4]

# Run the simulation 10000 times
for(i in 1:reps){
        #linearly independent predictors with x1 predicting y
        x1 <- rnorm(100, 2, 3)
        x2 <- rnorm(100, 2, 3)
        x3 <- rnorm(100, 2, 3)
        y <- x1 + rnorm(100, 5, 6)
        lm_uncorrelated <- summary(lm(y ~ x1 + x2 + x3))
        uncorrelated_mod_dump <- PopulateModDump(lm_uncorrelated, uncorrelated_mod_dump, 
                                                 p.val.col.i, se.col.i)
        #correlate x2 with x1 and rerun model
        x2b <- x1 + rnorm(100, 0, 1)
        lm_correlated <- summary(lm(y ~ x1 + x2b + x3))
        correlated_mod_dump <- PopulateModDump(lm_correlated, correlated_mod_dump, 
                                                 p.val.col.i, se.col.i)

Now let's check our results. First let's see what proportion of simulations detected a significant relationship between X1 and Y. Remember, this correlation was exactly the same in both models at about r = .45.


With the uncorrelated model 99.65% of the simulations detected the significant relationship between X1 and Y. In the correlated (with multicollinearity) model this percentage dropped to an abysmal 33.79%.

One may say, though, that perhaps this low rate is just explained by X2 being allocated all of the variance that it shared with X1. If so, most of the simulations of the correlated model should pick up a significant X2-Y relationship. This would suggest the correlated-predictors model wasn’t necessarily performing worse overall, even if it was not detecting a significant X1 slope. Perhaps it was just vacillating between X1 and X2 when selecting the variable that was predominantly related to Y. So to investigate this, let’s see what the simulation yields for estimates across all coefficients.

For each model we’ll aggregate the results across 10,000 trials and check the X1, X2, and X3 slope coefficients’ mean p-values and standard errors. If multicollinearity was just causing fluctuation in selecting the most significant predictor, we should see a combined total of ~99.65% of simulations yielding significant p-values (a match to the uncorrelated predictors model). But if the model actually get worse overall with multicollinearity, we should see a reduced proportion of statistically significant relationships (of both X1-Y and X2-Y combined). In addition, just to double check that our simulation worked as intended, we’ll check the correlations between X1 and X2 and between X1 and Y.

round(apply(uncorrelated_mod_dump, 2, mean), digits = 4)
round(apply(correlated_mod_dump, 2, mean), digits = 4)

Table 1. Aggregated Values Across 10,000 Simulations

X1 p-value X1 S.E. X2 p-value X2 S.E. X3 p-value X3 S.E. cor(X1,X2) cor(X1,Y) cor(X2,Y)
Uncorrelated-predictors Model 0.0010 0.2039 0.4954 0.2040 0.5000 0.2041 -0.0002 0.4456 -0.0009
Correlated-predictors Model 0.2376 0.6455 0.5010 0.6126 0.4997 0.2041 0.9483 0.4456 0.4229

Remember X1 was moderately correlated to Y at r=.45. Notice in the uncorrelated-predictor model (no multicollinearity) the average p-value across 10,000 simulations was .001, indicating it was able to pick up this linear dependence. But in the correlated-predictors (multicollinearity) model the average p-value for the X1-Y relationship jumps to .2376, suggesting a wider range of outcomes/p-values. Indeed, the SD of p-values across 10,000 simulations was .267, whereas it was just .009 in the uncorrelated-predictors model. You can also see the standard error of the X1 coefficient estimate increases by a factor of 3. Just as theory would predict with multicollinearity we are seeing more instability (less precision) in our estimates.

Interestingly, too, For X2 and X3, the mean p-values are about .5. This indicates they are likely uniformly distributed (the expected distribution under the null). For X3 this makes perfect sense. Across both models X3 is truly uncorrelated with Y.

But what about X2 in the correlated model? Why would this p-value also be uniformly distributed? Remember in the correlated model, there is a true moderate relationship between X2 and Y (about r = .42). So we should see a large percentage of simulations detecting this relationship, yielding many p-values close to 0.

Shockingly, we don’t see anything near this. In fact, only 4.84% of simulations in this model detected a significant X2-Y relationship. This simulation speaks volumes to problematic effects of multicollinearity. Remember in only 33.67% of simulations did the correlated-predictors model even pick up the X1-Y relationship. I hinted that maybe the remaining ~66% cycles would detect a significant X2-Y relationship. But nope. Just 4.84%!

Overall this suggests that with multicollinearity the estimates of X1 and X2 are unstable and result in a decline in our linear regression model performance. Indeed, in most iterations of the simulation, the estimates were not reflecting that any relationship exists between predictors and Y (despite there being two moderate relationships in the data). In addition, we see a huge spike in the variability of our estimates, with standard errors of both X1 and X2 increasing by a factor of 3.

Conclusions of the simulation

Overall, the model with multicollinearity completely missed the X2-Y relationship (only picking it up about 5% of the time, reflecting about Type I error levels). And more importantly it missed the X1-Y relationship an abysmal ~66% of the time. That shows the inflated Type II error rate with multicollinearity. This is due to the fact that compared to the linearly independent model, it loses precision on the X1 and X2 estimates by a factor of 3. Overall these results are consistent with what theory about multicollinearity would suggest. The simulation showed relative instability in the slope coefficient estimates and worse performance with (vs without) multicollinearity.


While you may be looking for a solution to multicollinearity, the reality is there aren’t any guaranteed 100% correct ways to handle it. The most common solution is to just remove all but one predictor in the set of correlated predictors. Which predictor you decide to keep depends on your particular problem domain, the theory you are working with, the ease of collecting data on that predictor, etc. Another solution I recommend is to simply collect more data. The standard errors will still be poorer with (vs without) multicollinearity, but as with almost all data work, your models should give you appreciably better performance with larger sample sizes.

A final note

As a concluding note, I want to emphasize, of course, that the effects of multicollinearity are related to the size of the interpredictor correlation. This simulation was run with X1 and X2 correlated at .95.

So as a final experiment, I re-ran the 10,000 cycle simulation at 40 different levels of correlation between X1-X2. In the following two graphs, you can thus see how the effects of multicollinearity differ at varying levels of the X1-X2 linear dependency.


You can see from the graph above the effects of multicollinearity are quite drastic when X1-X2 are strongly correlated (.8 or above), but become increasingly small after that. Now that said, it is worth noting that the minimum SE in this simulation (at r = .15) was still larger (SE = .2062) than that obtained in the uncorrelated model (SE = .2039; pictured as the red dot). Obviously this difference could matter in applications at scale.

In another graph (below), I have plotted the proportion of cycles in which the X1-Y relationship was determined to be statistically significant across varying degrees of multicollinearity. The red vertical line represents the uncorrelated model baseline percentage (.9965). You can see the same general pattern as above. Namely, there are big effects with multicollinearity at r = .8 or higher that drop off appreciably after that.


I hope you found this blogpost informative. Feel free to tinker with the code and adjust the correlation strengths, sample sizes, etc. to test it out for yourself.

Decision Tree Overview: what they are, how they work, when they are useful, and why I like them

One of the most invigorating aspects of the data science community is just the sheer number of available algorithms that you can add to your analytics toolkit. With all of these techniques at your disposal, you gain much flexibility in your data work and you can be confident that you will use an appropriate method when the scenario calls for it.

If your background is like mine (a more traditional inferential statistics education), you’ve likely encountered just a fraction of possible ways to model your data. But rest assured, if you’ve already learned the basic fundamentals of statistics, you should be able to ease your way into more complicated methods fairly seamlessly.

To start, it’s important to acknowledge the most common statistical techniques are just extensions of the general linear model (GLM). These include logistic regression, linear regression, and ANOVA (which itself is just a special case of linear regression). These methods are popular and for good reason; when a relationship between X and Y is truly linear (which it often is), they tend to perform extremely well.

But there are many situations in which X and Y are related, but in a non-linear way. What is there to do in such cases? Well, one technique is to simply modify the terms of the linear regression equation (using splines, segmenting the regression, adding polynomial terms, including interaction terms, etc.). But these models often end up being convoluted and lacking in interpretability (it’s not uncommon to catch yourself saying things like, “this models shows there is a 1.8-unit increase in the logged Y for every 1-unit increase in the quadradicness of X.”;  NO ONE wants to hear this except for stats geeks!) And compounding the problem, when multiple predictors are in the equation, the interpretation of that one coefficient gets even more complicated to explain.

Lucky for us, there is an algorithm that often outperforms the GLM, can deal with non-linearity, and also has the added benefit of being highly interpretable. I introduce to you, the decision tree.

What is a decision tree?

A decision tree is a purely data-driven (non-parametric) model that one can use for both regression and classification. It works by partitioning the data into nested subsets using a sequence of binary splits. The algorithm declares a root node attribute (at the top of the ‘tree’) and then progresses downward, making binary split decisions for how to subset the data until it is all optimally partitioned. At this point the splitting stops and the data resides in a terminal node (sometimes called a ‘leaf node’). The resulting structure is tree-like (really an inverted tree), with the root node at the top, decision nodes in the middle, and the leaf nodes at the bottom.

In this example above, the objective is to classify whether a customer will purchase a computer vs not. The predictor space is three dimensional (age, credit rating, and student status). In this case, all predictors are nominal, but decision trees can also work with continuous predictors. After a decision tree is built one simply feeds some new data into the tree, and the algorithm will trickle the data down until it lands in a leaf node. At this point, the computed prediction will be the modal or mean response of that particular leaf node. In the tree above, if a customer walks in that is a senior and has an excellent credit rating, you would predict that this customer is likely to purchase a computer.

How does a decision tree decide where to split?

So now that we know what a decision tree is, it’s natural to wonder how the tree is built and how all of this splitting actually works. In the example above you may also be wondering what determined age to be the root node and not credit rating or student status. Though specific details may vary depending on your implementation (there are a few), in essence, all trees will choose an attribute and split criterion that partitions the data into maximally homogeneous segments. Basically, at each step, the algorithm will consider every possible variable to split on and every possible ‘cutpoint’ value within the range of that variable. And the particular split criterion the algorithm chooses will produce a subset of data that is the most homogeneous.

How is homogeneity measured? Well, it depends on the structure of your predictors. In the example above, the predictors are all nominal variables and the response is also a nominal variable (purchase vs not). In such cases, a common splitting criterion is the value that minimizes entropy or analogously increases information gain, both of which you can read about here). Another algorithm that uses this basic idea (CHAID algorithm) will choose an attribute split that maximizes the test statistic in a chi-squared test of independence. A high test statistic here indicates the attribute and its splitting criterion are doing a good job of separating purchasers from non-purchasers. In the case of regression, the process is similar, but instead of homogeneity, the algorithm will compare predicted to actual responses, square them, and then choose a split that produces the greatest reduction in the sums of squared residuals (RSS).

Whichever criteria the algorithm uses (chi-square, RSS, entropy, information gain, etc.), the process recursively continues until there are no more splits. So after the tree makes its first split, it partitions the two resulting subsets of data again by splitting on some other value within that subset, and so on. The stopping criterion to determine no more splits is often set by the user, and can include metrics such as how many observations are assigned to each terminal node, how many splits are made, etc.

How do decision trees compare to linear models?

When a set of X’s is truly related to Y in some non-linear way, your standard least squares regression line will be biased (meaning a straight line doesn’t approximate the true nature of the relationship in the data). In such cases you can go ahead and fit a linear regression any way. It may even produce a respectable p-value and explain a significant portion of the overall variability in Y. But even in such cases when a linear model predicts decently, if the relationship is non-linear decision trees will typically outperform.

So, decision trees tend to perform better than linear models when there is a true non-linear relationship between X and Y. Another big advantage over the linear model is in its ease of interpretation. Though many linear regressions are fairly interpretable (it measures change in Y in units of X), things get much fuzzier when you start dealing with multiple predictors, interaction terms, polynomials, etc. In contrast, decision trees are extremely straightforward and can be readily understood by almost anybody (this is especially useful for communicating with those who may lack the proper training in analytics).

So picking up non-linearities and also ease of interpretation are the two main advantages of decision trees. Like with all algorithms, however, there is no free lunch. One disadvantage with decision trees is that they tend to model the signal so well that they actually often pick up a lot of the noise (this is called ‘overfitting’). Remember, there is always a random component in our data, and so really specific trees that can pick up peculiarities of the data structure may be modeling some of that random component and treating it like a systematic pattern. Luckily there are many methods to handle this pitfall of decision trees (ensemble methods, bagging, random forests, etc.). We won’t focus on these in this blogpost, but just know that there are ways to treat this scenario with decision trees.

How do I get started with decision trees?

Now that you have learned about decision trees, the best way to get better at using them is to go test them out! There are many great packages available in R and Python to get you started. The rpart package in R I would personally recommend. You can run it on the iris dataset in R and have a decision tree built in minutes!

In a future blog I will do a tutorial in R for building a decision tree and visualizing the results. It’s really quite simple. In the meantime, I welcome you to comment with any thoughts or questions!

The R Advantage – Tip #1

This is the first entry in a series (The R Advantage) that will spotlight useful R tips that I’ve picked up along the way. The audience is anyone from the absolute beginner who may be on the fence about switching to R, to the person who is just starting out in R and learning simple commands, to the more intermediate user who is looking to expand his or her R knowledge.

These tips are simple to execute, require no deep coding knowledge, and pay dividends in saving you time and generally making your life easier.

I’ll note that one thing I really would have appreciated when I was just starting R was to have a step-by-step, intuitive explanation of what code was doing at each line (so I will do this to make things clear for the beginners).

Without further ado, today’s R tip is: how to merge files that have unique observations (rows), but the same variable structure (columns).

Problem scenario: You have just run an experiment in a lab. Say you have collected reaction time data from participants, and your study produces 1 data file per participant. Come time for analysis, you have to somehow aggregate all of the data into a single file.

This problem frequently comes up in the social sciences (but probably shows up in many other fields too). And in fact, this is the first problem that led me to want to switch out of doing analysis with GUI’s like SPSS and to get further into unlocking the efficiency of R.

Before I knew how to use R, I would sometimes spend hours doing this aggregation process manually (opening each file, copying and pasting the data into a parent file, closing the child file, and then doing it again). It was exhausting, too mindless for me to enjoy, and was ultimately taking hours out of my day.

Solution: let R iterate through each file in your file directory, extract the contents, and build the central file for you.

Benefit: It literally saves you hours of time.

So here is the first chunk of code that personally convinced me of The R Advantage.

Note: this code assumes each of your data files are in the same folder (and are the only files in that folder). It also assumes the variable names across files are the same.

files_full <- list.files(getwd(), full.names=TRUE)
df <- data.frame() 

for (i in 1:length(files_full)) {
        currentFile <- read.csv[files_full[i]]
        df <- rbind(df, currentFile)

write.csv(df, "merged_data.csv")

Line 1: Lists all files in current directory and stores them into a variable called ‘files_full’

Line 2: Initializes an empty data frame (we will use this to build our central data file)

Line 4: Starts a for loop (a structure that ticks through a predetermined number of times – in this case, the number of files in the directory – and performs an operation on each iteration). Line 5 and 6 detail the operation.

Line 5: Reads in the data from the current participants’ file and stores it into a variable called ‘currentFile’

Line 6: Updates the central data frame (df) with the current participants’ file contents

Line 9: Writes and saves the central data frame to a merged data file in .csv format. 

That’s all for this entry in The R Advantage. I hope that you can see with just a few lines of code you can gain hours of time! I know you will want to try it out now. If you do, feel free to drop me a line in the comments and let me know how it works!

Poster accepted at INFORMS 2016

In November I’m headed to Nashville to present at the annual conference for the Institute for Operations Research and the Management Sciences (INFORMS). To preview some models I built and will be presenting, logistic and linear regression were used to make MLB predictions for 2016 (total wins and playoff status for both Chicago Cubs and White Sox). My next step is to build a webscraper to gather data and to perhaps do some agent-based modeling to enable day-by-day predictions for individual matchups (in addition to predicting the season-long trends).

I’m looking forward to presenting the initial data and of course meeting all the great people there!

A Bayesian Approach to the 2016 NCAA Tournament

My friend Chris and I decided to enter a Kaggle competition for the 2016 NCAA Tournament. March Madness is notoriously challenging to predict from a statistical point of view (see my alma mater MSU, a #2 seed, who earlier today lost in the first round to Middle Tennesee, a #15 seed. No one saw that coming. Thus is the nature of March Madness.

But barring those rare upset cases, we are actually pretty confident in our predictions, or at least the basis for them. We used a Bayesian technique to generate probabilities of each team winning, based on a latent ability variable (comprised of many “black-boxed” statistics). Our goal is to outperform a baseline model of simply choosing the higher-ranked team each round. You can see our predictions and read more about our model in the shared Dropbox folder below.


The 6 Principles of Effective Data Visualizations: Principle #1 – Show Comparisons, Contrasts, Differences

At their core, statistical analyses always involve making a comparison of something. We might want to know how voter turnout differs between Democrats and Republicans. Or scientists might test the effects of a new treatment by comparing an experimental group to a control group. We might just want to see how global temperature has changed over time. There are infinite examples that we could think up, but the key here is that analytics involves making comparisons of some sort.

It makes sense, then, that Principle #1 in making effective data visualizations is to Show Comparisons, Contrasts, Differences (in an interesting and appropriate way). A good display will show how two or more things (groups of people, treatment levels, years, states, etc.) compare to one another. Not only that, but our visualizations should draw out this comparison in a way that is engaging to the reader. Republicans might be better represented by the color red, for example, and Democrats in blue. Or if only one group differs from a number of other groups in some meaningful way, it would be sensible to draw attention to that one group with a visual representation of anomaly (perhaps by coloring its line on a scatterplot, while keeping all the other lines black). So, in sum, a good visualization will not only represent a comparison with the data, but it must do so in an interesting and sensible way.

Take a look at this example of a poor visualization. I got this from the site WTF Visualizations (which can help you learn what not to do in data science, or can at least give you a few laughs!). Any way, this violates our first principle in a number of ways.


First, there does not seem to be any comparison being made in the display. The visualization is simply listing the functions and benefits of vitamin B3. No comparisons, no contrasts, nothing. This makes for really boring content to visualize right from the get go. Even the slickest graphs in the world would not make this “story” (if you can call it that) interesting.

Not only that, but much to the reader’s confusion, parts of this visualization make it seem like there is a comparison being made (where there is not). For example, two of the statements hint at having an empirical basis (“significant decrease in heart disease” and “reduction in the risk of Alzheimers” both use terminology that sounds like statistical findings from a research publication), whereas the other two statements do not (“Assists in the energy production of cells” and “promotes healthy skin”, compared to the first two, have a more broad and “conjecture”-like format). This makes it unclear to the reader why there is a difference in phrasing at all for these statements. And adding to the confusion, these two sets of statements also differ in that one begins with verbs (“assists” and “promotes”), whereas the other begins with nouns (“reduction” and “decrease”). In both these examples of inconsistency, the visualization misleads the reader into thinking there might be a comparison being made between sets of statements (when, in fact, there is not). This is not only annoying to the reader, but costs the reader energy and time. And always remember: No good visualization should make the reader do more work than is necessary. 

The second major flaw in this visualization (which – to me – is actually more bothersome than it having no comparisons) is that the designers use a Venn-diagram style of presenting the four statements, indicating conceptual overlap. Yet there is no such overlap (nor would we expect there to be in an analysis that simply lists a few facts). In this case, if the designers of this visualization were going to stick with the bland content, they should have at least been more prudent to think up a more appropriate way to represent their facts. I actually think a simple list format (albeit extremely boring) would be more suitable in this case. There are many, many good ways to format these statements. In this case the four overlapping circles is not one of them. Leave that for scenarios in which four statements do overlap in some meaningful way.

The one and only area in which this visualization might be said to succeed is in the choice of using coffee (a food source that is high in vitamin B3) to represent the function of vitamin B3. I find this to be an appropriate use of imagery. And I also like that it separates the function of vitamin B3 from its benefits (by presenting the coffee mug as distinct from the three transparent circles). All that said, in my opinion, these points of strength don’t outweigh the major flaws of this visualization.

Remember our Principle #1 states that effective visualizations 1) make a comparison, and 2) represent that comparison in an appropriate and interesting way. The above visualization does not meet either of these criteria and is therefore, a poor visualization.

So far we’ve seen how not to demonstrate Principle #1. In the next entry we will turn briefly to a visualization that is a good demonstration of our principle. Stay tuned!


2016 MLB Predictions – Chicago Teams

Enjoy the analysis! I had fun learning how to do this, and then doing it!


Sabermetrics is the empirical analysis of baseball. Baseball has always been a game of statistics. But with the explosion of data in the past 7 years, in conjunction with the development of more advanced analytics techniques, baseball teams now – more than ever – can leverage their numbers in increasingly sophisticated ways. There is so much data available that even self-taught lay people – with the proper tools at their disposal – can be a part of the action.

Source: http://www.datanami.com/2014/10/24/todays-baseball-analytics-make-moneyball-look-like-childs-play/

In this report, I use linear regression to predict a number of outcomes for both the Chicago Cubs and the Chicago White Sox in 2016. After extrapolating from just a few linear models, I found projections that:

  • The Chicago White Sox will win 87 games in 2016 and will NOT make the playoffs.
  • The Chicago Cubs will win 108 games in 2016 and WILL make the playoffs.

Loading the Data

First, let’s load in a dataset of MLB statistics from 1962-2015.

baseball <- read.csv("baseball.csv")
## 'data.frame':    1322 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ RS          : int  720 573 713 748 689 622 640 669 737 689 ...
##  $ RA          : int  713 760 693 753 608 701 754 640 844 803 ...
##  $ W           : int  79 67 81 78 97 76 64 81 68 74 ...
##  $ OBP         : num  0.324 0.314 0.307 0.325 0.321 0.306 0.312 0.325 0.315 0.328 ...
##  $ SLG         : num  0.414 0.359 0.421 0.415 0.398 0.38 0.394 0.401 0.432 0.42 ...
##  $ BA          : num  0.264 0.251 0.25 0.265 0.244 0.25 0.248 0.256 0.265 0.27 ...
##  $ Playoffs    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ RankSeason  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RankPlayoffs: int  NA NA NA NA 5 NA NA NA NA NA ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.322 0.339 0.321 0.327 0.29 0.322 0.328 0.297 0.353 0.33 ...
##  $ OSLG        : num  0.423 0.426 0.418 0.424 0.372 0.41 0.42 0.386 0.461 0.444 ...

The dataset contains many variables related to historical team performance. The table below describes each variable:

Variable Name Variable Description
Team MLB team abbreviations (e.g. ARI = Arizona Diamondbacks)
League American League (AL) or National League (NL)
Year Year
RS Runs scored by team’s batters
RA Runs allowed by team’s pitchers
W Total wins in regular season
OBP On-base percentage: a measure of how often a batter reaches base
SLG Slugging percentage: a measure of the power of a hitter (total bases/total at bats)
BA Batting average: a measure of the average performance of a batter (hits/total at bats)
Playoffs Dichotomous variable representing whether or not a team made the playoffs
RankSeason How each team ranked overall in the regular season
RankPlayoffs How each playoff team ranked in the playoffs
G Total games played
OOBP Opponent’s OBP: a measure of how often a team lets the opponents reach base
OSLG Opponent’s SLG: a measure of how far along the bases a team allows the opponent to get

Framing the Analysis

The primary goal of this analysis is to determine if either of the Chicago baseball teams will make it to the playoffs in 2016. To do this, it would be helpful to know how many regular season games a team must typically win to get to the playoffs.

Making it to the playoffs

Graphs shows that historically, if a team wins 94 or more games, they have a strong chance of making it to the playoffs.

Image of playoff wins

Click here for more information about this chart.

Winning 94 games So how does a team win games?

A baseball win happens when you outscore your opponent. We will create a variable, “RD” (runs differential) to represent the extent to which a team outscores their opponent. RD is simply calculated by subtracting a team’s runs allowed from their runs scored.

baseball$RD <- baseball$RS-baseball$RA

Indeed, historically there is a strong relationship between RD and wins.

The graph shows a clear linear pattern. But let’s actually put some numbers on this relationship.

Building a Model

We will build a linear model to predict wins with runs differential.

winsReg <- lm(W ~ RD, data = baseball)
## Call:
## lm(formula = W ~ RD, data = baseball)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4023  -2.7977   0.0735   2.8746  12.7747 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 80.910741   0.110676  731.06 <0.0000000000000002 ***
## RD           0.104284   0.001082   96.39 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 4.024 on 1320 degrees of freedom
## Multiple R-squared:  0.8756, Adjusted R-squared:  0.8755 
## F-statistic:  9290 on 1 and 1320 DF,  p-value: < 0.00000000000000022

The model summary indicates there is is a strong linear relationship between runs differential and wins. The p-value is highly significant, and the model produced an R^2 value that is typically considered high.

Using the Model

So now we’ve seen with both a chart and a model summary that RD is a good predictor of regular season wins. One useful application of the model summary is that now we can compute just how big the run differential has to be for a team to have a strong chance of winning 94 games and then making it to the playoffs.

Remember our summary showed us that:

  • Wins = 80.910741 + 0.104284(RD)

Thus, for a team to have a strong chance of making it to the playoffs, the value of (80.910741 + 0.104284*RD) must be greater than or equal to 94.

Do some simple arithmetic:

thresholdRD <- (94-80.910741)/0.104284
## [1] 125.5155

To win 94 games and have a strong chance of making the playoffs, in the regular season a team must outscore their opponents by 126 runs.

Doing the Analysis

Now that we have a couple of key figures in hand, let’s proceed with the actual analysis.

Remember, we want to determine if either of the Chicago baseball teams will make it to the playoffs this year. We know that if a team’s RD is at least 126, they will probably win 94 games and get to go to the playoffs.

So how can we predict the RD for each team in 2016? We will predict how many runs each team will score in 2016 (this can be done with batting statistics) and then how many runs each team will allow in 2016 (this can be done with pitching statistics).

Building a Model for Runs Scored

Runs scored (RS) can be predicted by OBP (on base %, includes walks) and SLG (how far a player gets around the bases at his at-bat; this measures a batter’s power).

Many teams used to focus on batting average. However OBP is well-known to be the more important predictor. Baseball analysts also consider SLG to be important.

Let’s confirm this with a regression model.

runsReg <- lm(RS ~ OBP + SLG + BA, data = baseball)
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.400 -16.779  -1.008  16.935  93.445 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -811.41      16.66 -48.707 <0.0000000000000002 ***
## OBP          2941.02      94.69  31.061 <0.0000000000000002 ***
## SLG          1525.85      36.58  41.708 <0.0000000000000002 ***
## BA           -154.84     111.65  -1.387               0.166    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 25.42 on 1318 degrees of freedom
## Multiple R-squared:  0.9212, Adjusted R-squared:  0.921 
## F-statistic:  5137 on 3 and 1318 DF,  p-value: < 0.00000000000000022

Note: Both OBP and SLG are significant predictors of runs scored. BA is not. Moreover, note that BA is also negative, meaning that the lower the BA, the more runs scored. This obviously makes no sense. In statistics, when this happens, it usually means there is multicollinearity in the data (model summaries typically generate the amount of unique variance accounted for by each of the predictors. Because OBP, SLG, and BA are all likely correlated with one another, once OBP and SLG have accounted for their unique portion of the variability, there is very little left over for BA to explain). Thus, for this reason, we can simplify our model by removing BA.

Now we’ll re-run the model with only OBP and SLG as predictors.

runsReg2 <- lm(RS ~ OBP + SLG, data = baseball)
## Call:
## lm(formula = RS ~ OBP + SLG, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.871 -16.777  -1.116  16.998  93.008 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -817.51      16.07  -50.86 <0.0000000000000002 ***
## OBP          2859.24      74.11   38.58 <0.0000000000000002 ***
## SLG          1507.33      34.07   44.24 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 25.43 on 1319 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic:  7700 on 2 and 1319 DF,  p-value: < 0.00000000000000022

The new model is much better. With only two predictors, it is more parsimonious, and it has almost the exact same R^2 value as the previous model.

Looking at the coefficients for this model, OBP is almost twice as large as SLG. Because they are on the same scale, this tells us that OBP is a better predictor of runs scored than SLG (OBP has about twice the predictive power of SLG).

In sum, BA is overvalued, SLG is important, but OBP is more important.

Building a Model for Runs Allowed

Runs allowed (RA) can be predicted by OOBP (opponent’s on base % allowed by the pitchers) and OSLG (opponent’s SLG allowed by the pitchers).

Let’s build a model to predict RA with OOBP and OSLG.

ramodel <- lm(RA ~ OOBP + OSLG, data = baseball)
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.517 -17.954   0.902  18.424  71.930 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -851.48      24.63  -34.57 <0.0000000000000002 ***
## OOBP         2777.06     138.80   20.01 <0.0000000000000002 ***
## OSLG         1639.65      81.13   20.21 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 26.68 on 507 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9139 
## F-statistic:  2702 on 2 and 507 DF,  p-value: < 0.00000000000000022

This model is strong; both OOBP and OSLG are significantly related to RA.

2016 Predictions

So far we have established a number of key facts in this analysis:

  • If a team wins 94 games they will probably make it to the playoffs
  • If a team outscores their opponents by 126 runs (RD >= 126), they have a strong chance of meeting the 94-game threshold.
  • A team’s RD is predicted by RS and RA.
    • RS is predicted by OBP and SLG.
    • RA is predicted by OOBP and OSLG.

Given these figures, we will extrapolate from our models to predict OBP, SLG, OOBP, and OSLG for each Chicago team in 2016. Then with those projections, we will predict how many games each Chicago team will win in 2016.

Methodology: 2016 Batting Figures

These numbers were available on the internet.

First, I obtained the projected 2016 roster for both the White Sox and the Cubs.

Then, for each player, I computed composite figures for OBP and SLG, using the results of 4 different sabermetrics projection systems:

By using multiple systems, I wanted to lessen the impact that any one system’s error had on the overall prediction.

Each system is slightly different (you can read more about their own methodologies by clicking the links above), but all of them involve some kind of weighted average of the previous 3 years’ statistics.

Methodology: 2016 Pitching Figures

I obtained the projected 2016 lineup and bullpen for both the White Sox and the Cubs.

Unfortunately, unlike the batting statistics, the 4 available sabermetrics systems did NOT include projections for OOBS and OSLG.

To make these projections, I simply used a weighted average of the previous 3 years’ OOBS and OSLG data for each pitcher in both the lineup and the bullpen, and then weighted those figures based on how much playing time each pitcher gets.

Weighted Average Based on Past 3 Years

The weights I used were generated in a previous sabermetrics analysis. This analysis showed for a given player, how predictive were each of his previous season’s data for his current season figures. The authors came up with weights of .47, .32, and .18, respectively for each of the previous 3 years.

Note: 3 pitchers in my analysis only had 2 seasons worth of data. In these cases, I used weights of 0.6 and 0.4 (roughly equivalent to the ratio of (.47/.32)). For players with only 1 year’s worth of data, I weighted that data at 1. I excluded all rookie players from the analysis.

Weighted Average Based on Playing Time

After generating each pitcher’s weighted OOBP and OSLG, I then weighted each of those figures based on how much playing time a pitcher gets (to do this, I computed the proportion of batters faced that each pitcher contributed to the team’s total batters faced in the 2015 season). My rationale was that the pitchers who face more batters, such as starting pitchers, should be weighted more heavily.

2016 Predicted Figures

Using these methodologies, I obtained the following figures:

Team Statistic Projected Figure
White Sox OBP 0.321666667
White Sox SLG 0.41475
Cubs OBP 0.34253125
Cubs SLG 0.43878125
White Sox OOBP 0.320957143
White Sox OSLG 0.38205
Cubs OOBP 0.29843875
Cubs OSLG 0.353745

It appears that in 2016 the Chicago Cubs will perform better in both batting and pitching than the Chicago White Sox.

Predicting Outcomes for the Chicago White Sox

But what does this mean for the White Sox in terms of making the playoffs?

Remember: if we can predict the runs scored and runs allowed for a team, we can also predict how many games they will win.

Using the runs scored linear regression model (the one that uses OBP and SLG as independent variables), we can find the number of runs we expect the White Sox to score:

## Call:
## lm(formula = RS ~ OBP + SLG, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.871 -16.777  -1.116  16.998  93.008 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -817.51      16.07  -50.86 <0.0000000000000002 ***
## OBP          2859.24      74.11   38.58 <0.0000000000000002 ***
## SLG          1507.33      34.07   44.24 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 25.43 on 1319 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic:  7700 on 2 and 1319 DF,  p-value: < 0.00000000000000022
obp_chw_2016 <- 0.321666667 
slg_chw_2016 <- 0.41475
runspredicted_chw_2016 <- -817.51 + 2859.24*obp_chw_2016 + 1507.33*slg_chw_2016
## [1] 727.3773

The White Sox are predicted to score ~727 runs in 2016.

Now we can also find the predicted runs allowed, using the runs allowed regression. Remember, this uses opponents OBP (OOBP) and oppenents SLG (OSLG).

## Call:
## lm(formula = RA ~ OOBP + OSLG, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.517 -17.954   0.902  18.424  71.930 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -851.48      24.63  -34.57 <0.0000000000000002 ***
## OOBP         2777.06     138.80   20.01 <0.0000000000000002 ***
## OSLG         1639.65      81.13   20.21 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 26.68 on 507 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9139 
## F-statistic:  2702 on 2 and 507 DF,  p-value: < 0.00000000000000022
oobp_chw_2016 <- 0.320957143
oslg_chw_2016 <- 0.38205
runsAllowedPredicted_chw_2016 <- -851.48 + 2777.06*oobp_chw_2016 + 1639.65*oslg_chw_2016
## [1] 666.2655

The White Sox are predicted to allow their opponents to score ~666 runs 2016.

Now we can calculate the predicted difference between runs scored and runs allowed.

predictedRD_chw <- runspredicted_chw_2016-runsAllowedPredicted_chw_2016
## [1] 61.11179

How many games will White Sox win? We can use the runs differential regression to predict 2016 wins.

wins_chw <- 80.910741 + 0.104284*61.11179
## [1] 87.28372

Thus, the Chicago White Sox are predicted to win 87 games (<94), meaning they will probably NOT make the playoffs in 2016.

Predicting Outcomes for the Chicago Cubs

Using the same runs scored linear regression model, we can find the number of runs we expect the Cubs to score:

## Call:
## lm(formula = RS ~ OBP + SLG, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.871 -16.777  -1.116  16.998  93.008 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -817.51      16.07  -50.86 <0.0000000000000002 ***
## OBP          2859.24      74.11   38.58 <0.0000000000000002 ***
## SLG          1507.33      34.07   44.24 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 25.43 on 1319 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic:  7700 on 2 and 1319 DF,  p-value: < 0.00000000000000022
obp_chc_2016 <- 0.34253125
slg_chc_2016 <- 0.43878125
runspredicted_chc_2016 <- -817.51 + 2859.24*obp_chc_2016 + 1507.33*slg_chc_2016
## [1] 823.2572

The Cubs are predicted to score ~823 runs in 2016.

Now we will use the runs allowed regression again to predict how many runs the Cubs will allow.

## Call:
## lm(formula = RA ~ OOBP + OSLG, data = baseball)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.517 -17.954   0.902  18.424  71.930 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -851.48      24.63  -34.57 <0.0000000000000002 ***
## OOBP         2777.06     138.80   20.01 <0.0000000000000002 ***
## OSLG         1639.65      81.13   20.21 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 26.68 on 507 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9139 
## F-statistic:  2702 on 2 and 507 DF,  p-value: < 0.00000000000000022
oobp_chc_2016 <- 0.29843875
oslg_chc_2016 <- 0.353745
runsAllowedPredicted_chc_2016 <- -851.48 + 2777.06*oobp_chc_2016 + 1639.65*oslg_chc_2016
## [1] 557.3203

The Cubs are predicted to limit their opponents to scoring only 557 runs.

Now we will calculate the predicted difference between runs scored and runs allowed for the Cubs.

predictedRD_chc <- runspredicted_chc_2016-runsAllowedPredicted_chc_2016
## [1] 265.9369

And with an RD ~266, how many games will the Cubs actually win?

wins_chc <- 80.910741 + 0.104284*predictedRD_chc
## [1] 108.6437

The Chicago Cubs are predicted to win 108 games (>94), meaning they WILL probably make the playoffs in 2016.

This RD and number of wins is a relatively high number. To put this in perspective, remember this graph:

This graph contains historical data for 50+ years of baseball. A run differential of 265 would position the 2016 Cubs as one of the highest performing baseball teams of all time. Could be a great year for Cubs fans!


  • The Chicago White Sox are projected to win 87 games in 2016 and NOT make the playoffs.
  • The Chicago Cubs are projected to win 108 games in 2016 and TO make the playoffs. Moreover, there is a good chance the 2016 Cubs will have an outstanding season!