Linear Regression Simplified: A Visual Tutorial with a Lab in R

Linear regression is perhaps the most widely used statistical tool in data science. It is a fundamental building block for statistical inference and prediction, and many more advanced methods derive from simple linear regression. This blogpost will walk you through the basic concept of linear regression. It will also provide a code tutorial to perform a linear regression in R.

By the end of this blogpost you will know:

  1. What linear regression is
  2. How linear regression works
  3. How to perform linear regression in R and how to interpret its output

What is linear regression?

Simple linear regression is a statistical method that describes the relationship between two continuous quantitative variables. What makes it linear is that it uses a line to describe the relationship.

As you can see below, we are describing the relationship between car weight and gas mileage. And it appears that a downward sloping line describes this relationship well. Generally, as car weight increases, gas mileage decreases.

 

But notice that whereas the line does a good job of capturing the overall downwardtrend” of this relationship, it is also clearly an imperfect model. As you can see, many points are above the line and many are below it. In fact, the line does not even come close to passing through most of the data points. This means there is also some “scatter” or “error” about the line.

The goal of linear regression is to fit a line to the data that minimizes the scatter.

 

How does linear regression work?

Linear regression works by finding a line of best fit to pass through the data. Infinitely many lines can be fit to the data, but the best fitting line (called the “line of best fit”) will be the one that minimizes scatter. The metric that is used to quantify scatter is called the sum of squared residuals (sometimes called sum of squared error or SSE for short). 

Each data point has a corresponding “residual,” which is just a quantity representing the vertical distance between it and the line of best fit. In the plot below all the residuals are demarcated by black lines. You can see each residual represents how far ‘off’ the data point is from its estimated point on the line. In other words, the residual represents how much ‘error’ that data point contributes to the model.

lm_sse_illustration

A tacit goal of linear regression (or really any statistical model) is to minimize the overall model error. This should make intuitive sense. The less error there is in a model, the better it is fitting to the data. Therefore, we want to treat both positive residuals (those above the line) and negative residuals (those below the line) as equally ‘bad’. So we square each residual value and sum them up. The line of best fit is the line that minimizes the sum of these squared residuals.

You can see the line of best fit above for estimating gas mileage in mpg is calculated to be: 37.3-5.34*car weight(in 1000’s of pounds)

This equation tells us that any car’s gas mileage (mpg) is predicted to be 37.3 – 5.34its weight in thousands of pounds. So if a car weighs 3000 pounds, we would estimate its mpg to be (37.3-5.343) = 21.28.

You may be wondering how the linear regression model settled on that equation. Under the hood, the solution involves doing matrix algebra and other fancy inversion operations. We won’t go into these details here. Luckily in practice modern computers will take care of this for us.

Now let’s learn how to build our own linear regression model in R. We’ll walk through it step-by-step.

Applying Linear Regression: A Lab

For this lab we will use the stock data set mtcars in R. Load up the data and let’s check out the variables.

head(mtcars)

mtcars_head.jpg

As you can see this data set contains information about cars, including their miles per gallon (mpg) weight (wt), number of cylinders (cyl), etc.

We will build a simple linear regression model that predicts mpg with wt.

mod <- lm(mpg ~ wt, data = mtcars)

Notice we have to specify three parameters:

  1. The outcome variable to be predicted (mpg; sometimes called the dependent variable)
  2. The predictor variable (wt; sometimes called the independent variable)
  3. The data set (in this case mtcars).

Now let’s check the results!

summary(mod)

mod_summary.jpg

You can see the summary statistics from our model above.

The Estimate column gives us the values for our line of best fit equation. Recall from earlier that for predicting mpg from wt this was given by 37.3 – 5.34*wt. Importantly, the intercept is there just to fit the line, but in most regression contexts it doesn’t mean anything theoretically. However, the correct interpretation for the predictor coefficient is that for every 1-unit increase in car weight (in this context 1 unit =1000 pounds), that car’s gas mileage decreases by 5.34 mpg. It decreases because the coefficient is negative. If it were positive, the translation would change to “increases.”

The Standard Error column tells us how much precision we had in estimating the respective terms. Generally if there is wider scatter, there will be a higher standard error. This makes sense visually. See below two models:

std_err_illustration

The model in the left pane is the model we just fit. The model in the right pane is a model I just made up (which is just wt predicting 3*mpg). You can see there are larger residuals in this second model. This model therefore has less precision in its estimate of the line of best fit. Generally, you want the data points tightly hugging the line (such as what is illustrated by the model in the left pane). When data points are further away from the line it indicates less precision.

So standard error is a quantity representing the precision in our estimates. Here is the same table of summary statistics as above, so you don’t have to scroll.

mod_summary.jpg

The t-value column gives us a test statistic for our obtained equation values. It represents how many standard errors away from 0 is the obtained value on a null distribution. Though we have to sacrifice some detail here to make this explanation intuitive, you can crudely think of the t value as a measurement of the size of the effect (in particular, the size of the effect in terms of standard errors). However, a more intuitive metric to understand (that is directly related to the t-statistic) is the p-value.

The p-value basically tell us how ‘surprising’ the obtained results are, or how strong the evidence is for us having an effect. We generally want small p-values. P-values below .05 indicate that an effect is statistically significant. A p-value of .01 is thus even more significant, and so on. In our output above, the p-value column indicates that wt is significantly and strongly related to mpg.

Notice too, both Residual standard error and Multiple R-squared are metrics for judging how well our line fits the data. Generally lower residual standard error and higher R-squared indicate a better fit to the data.

That sums up the output in R. However, another cool trick in R Studio is to type the $ symbol right after your model object. It will generate a list of summary statistics from your model that you can explore further.

mod_params.jpg

If you want to check out the coefficients, residuals,  fitted values, etc., you can do that.

This can really come in handy if you want to feed any of these values elsewhere in your analysis. Say for example you wanted the exact sum of squares residuals value. You can obtain it by typing:

sum(mod$residuals^2)

Conclusion

Congratulations! You are now officially more educated in linear regression! You know what linear regression is, what it is used for, how it works, and also how to do it in R. I hope that you found this useful. I would be interested to hear about your experiences with linear regression, so please do leave some thoughts below!

Advertisements

Multicollinearity and the precision of regression coefficient estimates

At some point in our lives we have all heard that multicollinearity is not good. But what exactly is it, why is it a problem, how can we detect it, and what are some solutions to it? This blog post will answer all of these questions and will also show empirically the effects of multicollinearity with a demo in R.

What is multicollinearity?

Multicollinearity is the state of there being linear dependence among two or more predictors in a regression model (or any forms of the generalized linear model like logistic regression, ANOVA, etc.). In plain English, if your predictors are strongly correlated with one another, you’ve got multicollinearity. Typically a correlation coefficient of .8 or .9 among predictors indicates multicollinearity.

Why is it a problem?

In order to understand why multicollinearity is a problem, think about the goals of multiple regression. You want to know the effect (on average) that a 1-unit change in one variable (your predictor) has on another variable (your outcome). If you were predicting housing prices (Y) based on the square footage of the house (X1) and the nearby school quality (X2), a multiple regression model would tell you not just whether square footage and school quality are significantly related to price, but by how much. This “how much” question is answered by the beta (slope) coefficients in the model output, which tell us, “for every 1-unit change in X1, holding constant levels of X2, you can expect Y to change by the value of the X1 coefficient”.

Multicollinearity is a problem because linear dependence among predictors can cause instability in the estimates of our slope coefficients. This means the resulting t-statistics and p-values for those predictors are not reliable. In our previous example with real estate, if you were considering adding a room to your house to increase the square footage, the best model would tell you precisely (with confidence) by how much you can expect the price of your house to increase. But to the extent there is multicollinearity in the predictors, our confidence in that estimate is undermined.

So where does this lack of estimate precision come from? The nuts and bolts of it is that it all has to do with instability in the design matrix when computing a least squares regression line. Below is a set of matrices and some made up data:

  •  A 5×5 matrix X representing 5 observations of 5 parameters (the intercept X0 and four predictors X1, X2, X3, and X4)
  •  The transpose of X
  •  A 5×1 matrix Y representing the 5 observations’ responses on the outcome variable

multicollinearity_matrices

The equation for calculating the slope coefficients in a linear regression model is shown below:

normal_eq

Basically you can see this equation inverts the product of X and X-transposed and then multiplies it by the product of X-transposed and Y. This is how in linear regression models the computer formulates a line of best fit. And the reason why multicollinearity will muck up this formulation (for reasons we won’t detail too much) is that the inverse of a square matrix can only exist if all the columns are linearly independent. So if there is linear dependence, it becomes impossible to carry out that inversion part of the equation.

So, that is the formal reason why perfect multicollinearity will negate the least squares calculation. But what about when predictors are maybe not perfectly linear dependent, but instead are something close (like correlated at .9)? Here, the matrix inversion can occur, and the equation above will generate the slope coefficients, but those estimates will end up being unstable and lacking precision (if you are interested in knowing why, check out this page).

Simulating the effects of multicollinearity in R

Let’s now show the effects of multicollinearity empirically in R. First we will set a repetition counter to 10,000 cycles and initialize two empty data frames, one for a linearly independent model (without multicollinearity) and one for a linearly dependent model (with multicollinearity). We will investigate how multicollinearity affects the p-values and standard errors of our regression slope coefficients. If theory is correct, we should see the linearly-dependent model yielding larger standard errors and more fluctuation in p-values.

set.seed(051617)
reps = 10000

uncorrelated_mod_dump <- data.frame(x1_pval = numeric(reps), x1_se = numeric(reps),
                              x2_pval = numeric(reps), x2_se = numeric(reps),
                              x3_pval = numeric(reps), x3_se = numeric(reps),
                              cor_x1x2 = numeric(reps), cor_x1y = numeric(reps))

correlated_mod_dump <- data.frame(x1_pval = numeric(reps), x1_se = numeric(reps),
                              x2_pval = numeric(reps), x2_se = numeric(reps),
                              x3_pval = numeric(reps), x3_se = numeric(reps),
                              cor_x1x2 = numeric(reps), cor_x1y = numeric(reps))

For each of the 10,000 iterations we will simulate two models and store the results in their respective data frames:

Model 1 (uncorrelated-predictors model) will be a model with 3 linearly independent predictors and X1 moderately correlated to Y (r = ~.45). Model 2 (correlated-predictors model) will be the same model, but we’ll adjust X2 to be correlated with X1 at r = ~.95 (this will inject some multicollinearity in it).

for(i in 1:reps){
        #linearly independent predictors with x1 predicting y
        x1 <- rnorm(100, 2, 3)
        x2 <- rnorm(100, 2, 3)
        x3 <- rnorm(100, 2, 3)
        y <- x1 + rnorm(100, 5, 6)

        lm_uncorrelated <- summary(lm(y ~ x1 + x2 + x3))

        uncorrelated_mod_dump$cor_x1y[i] = cor(x1,y)
        uncorrelated_mod_dump$cor_x1x2[i] = cor(x1,x2)
        uncorrelated_mod_dump$x1_pval[i] = lm_uncorrelated$coefficients[2,4]
        uncorrelated_mod_dump$x1_se[i] = lm_uncorrelated$coefficients[2,2]
        uncorrelated_mod_dump$x2_pval[i] = lm_uncorrelated$coefficients[3,4]
        uncorrelated_mod_dump$x2_se[i] = lm_uncorrelated$coefficients[3,2]
        uncorrelated_mod_dump$x3_pval[i] = lm_uncorrelated$coefficients[4,4]
        uncorrelated_mod_dump$x3_se[i] = lm_uncorrelated$coefficients[4,2]

        #correlate x2 with x1 and rerun model
        x2b <- x1 + rnorm(100, 0, 1)
        lm_correlated <- summary(lm(y ~ x1 + x2b + x3))

        correlated_mod_dump$cor_x1y[i] = cor(x1,y)
        correlated_mod_dump$cor_x1x2[i] = cor(x1,x2b)
        correlated_mod_dump$x1_pval[i] = lm_correlated$coefficients[2,4]
        correlated_mod_dump$x1_se[i] = lm_correlated$coefficients[2,2]
        correlated_mod_dump$x2_pval[i] = lm_correlated$coefficients[3,4]
        correlated_mod_dump$x2_se[i] = lm_correlated$coefficients[3,2]
        correlated_mod_dump$x3_pval[i] = lm_correlated$coefficients[4,4]
        correlated_mod_dump$x3_se[i] = lm_correlated$coefficients[4,2]

}

Now let’s check our results. First let’s see what proportion of simulations detected a significant relationship between X1 and Y. Remember, this correlation was exactly the same in both models at about r = .45.

sum(uncorrelated_mod_dump$x1_pval <= .05)/reps
sum(correlated_mod_dump$x1_pval <= .05)/reps

pvals

With the uncorrelated model 99.65% of the simulations detected the significant relationship between X1 and Y. In the correlated (with multicollinearity) model this percentage dropped to an abysmal 33.79%.

One may say, though, that perhaps this low rate is just explained by X2 being allocated all of the variance that it shared with X1. If this were the case, most of the simulations of the correlated model should pick up a significant X2-Y relationship. This would suggest the correlated-predictors model wasn’t necessarily performing worse overall, but perhaps it was just vacillating between X1 and X2 when selecting the predominant predictor of Y.

To investigate this, for each model we’ll aggregate the results across 10,000 trials and check the X1, X2, and X3 slope coefficients’ mean p-values and standard errors. If multicollinearity was just causing fluctuation in selecting the most significant predictor, we should see a combined total of ~99.65% of simulations yielding significant p-values. But if the model actually get worse overall with multicollinearity, we should see a reduced proportion of statistically significant relationships (of both X1-Y and X2-Y combined). In addition, just to double check that our simulation worked as intended, we’ll check the correlations between X1 and X2 and between X1 and Y.

round(apply(uncorrelated_mod_dump, 2, mean), digits = 4)
round(apply(correlated_mod_dump, 2, mean), digits = 4)

Table 1. Aggregated Values Across 10,000 Simulations

X1 p-value X1 S.E. X2 p-value X2 S.E. X3 p-value X3 S.E. cor(X1,X2) cor(X1,Y) cor(X2,Y)
Uncorrelated-predictors Model 0.0010 0.2039 0.4954 0.2040 0.5000 0.2041 -0.0002 0.4456 -0.0009
Correlated-predictors Model 0.2376 0.6455 0.5010 0.6126 0.4997 0.2041 0.9483 0.4456 0.4229

Remember X1 was moderately correlated to Y at r=.45. Notice in the uncorrelated-predictor model (no multicollinearity) the average p-value across 10,000 simulations was .001, indicating it was able to pick out this correlation. But in the correlated-predictors model (with multicollinearity) the average p-value for the X1-Y relationship jumps to .2376, suggesting a wider range of outcomes/p-values. Indeed, the SD of p-values across 10,000 simulations was .267, whereas it was just .009 in the uncorrelated-predictors model. You can also see the standard error of the X1 coefficient estimate increases by a factor of 3. Just as theory would predict with multicollinearity we are seeing more instability (less precision) in our estimates.

Interestingly, too, For X2 and X3, the mean p-values are about .5. This indicates they are likely uniformly distributed (the expected distribution under the null). For X3 this makes perfect sense. Across both models X3 is truly uncorrelated with Y.

But what about X2 in the correlated model? Why would this p-value also be uniformly distributed? Remember in the correlated model, there is a true moderate relationship between X2 and Y (about r = .42). So we should see a large percentage of simulations detecting this relationship, yielding many p-values close to 0.

Shockingly, we don’t see anything near this. In fact, only 4.84% of simulations in this model detected a significant X2-Y relationship. This speaks volumes to the not so desirable effects of multicollinearity. Remember in only 33.67% of simulations did the correlated-predictors model even pick up the X1-Y relationship. I hinted that maybe the remaining ~66% cycles would detect a significant X2-Y relationship. But nope. Just 4.84%!

Overall this suggests that with multicollinearity the estimates of X1 and X2 are not just unstable with regards to formulating a consistent model for predicting Y, but are in fact causing a decline in our linear regression model performance. Indeed, most often the estimates are not reflecting that any relationship exists between predictors and Y (despite there being two moderate relationships in the data). In addition, we see a huge spike in the variability of our estimates with standard errors of both X1 and X2 increasing by a factor of 3.

Conclusions of the simulation

Overall, the model with multicollinearity completely missed the X2-Y relationship (only picking it up about 5% of the time, reflecting about Type I error levels). And more importantly it missed the X1-Y relationship an abysmal ~66% of the time. This is due to the fact that compared to the linearly independent model, it loses precision on the X1 and X2 estimates by a factor of 3. Overall these results are consistent with what theory and folk wisdom about multicollinearity would suggest. The simulation showed relative instability in the slope coefficient estimates and worse performance with (vs without) multicollinearity.

Solutions

While you may be looking for a solution to multicollinearity, the reality is there aren’t any guaranteed 100% correct ways to handle it. The most common solution is to just remove all but one predictor in the set of correlated predictors. Which predictor you decide to keep depends on your particular problem domain, the theory you are working with, the ease of collecting data on that predictor, etc. Another solution I recommend is to simply collect more data. The standard errors will still be poorer with (vs without) multicollinearity, but as with almost all data work, your models should give you appreciably better performance with larger sample sizes.

A final note

As a concluding note, I want to emphasize, of course, that the effects of multicollinearity are related to the size of the interpredictor correlation. This simulation was run with X1 and X2 correlated at .95.

So as a final experiment, I re-ran the 10,000 cycle simulation at 40 different levels of correlation between X1-X2. In the following two graphs, you can thus see how the effects of multicollinearity differ at varying levels of the X1-X2 linear dependency.

multicol-estimate-precision

You can see from the graph above the effects of multicollinearity are quite drastic when X1-X2 are strongly correlated (.8 or above), but become increasingly small after that. Now that said, it is worth noting that the minimum SE in this simulation (at r = .15) was still larger (SE = .2062) than that obtained in the uncorrelated model (SE = .2039; pictured as the red dot). Obviously while this difference seems objectively small, it could matter in applications at scale.

In another graph (below), I have plotted the proportion of cycles in which the X1-Y relationship was determined to be statistically significant across varying degrees of multicollinearity. The red vertical line represents the uncorrelated model baseline percentage (.9965). You can see the same general pattern as above. Namely, there are big effects with multicollinearity at r = .8 or higher that drop off appreciably after that.

multicol-ability-detect-effect

Overall these results are consistent with theory and common folk wisdom of multicollinearity. I hope you found this blogpost informative. Feel free to tinker with the code and adjust the correlation strengths, sample sizes, etc. to test it out for yourself.

Let me know in the comments what you think! And please share any experiences that you may have had with multicollinearity!

Decision Tree Overview: what they are, how they work, when they are useful, and why I like them

One of the most invigorating aspects of the data science community is just the sheer number of available algorithms that you can add to your analytics toolkit. With all of these techniques at your disposal, you gain much flexibility in your data work and you can be confident that you will use an appropriate method when the scenario calls for it.

If your background is like mine (a more traditional inferential statistics education), you’ve likely encountered just a fraction of possible ways to model your data. But rest assured, if you’ve already learned the basic fundamentals of statistics, you should be able to ease your way into more complicated methods fairly seamlessly.

To start, it’s important to acknowledge the most common statistical techniques are just extensions of the general linear model (GLM). These include logistic regression, linear regression, and ANOVA (which itself is just a special case of linear regression). These methods are popular and for good reason; when a relationship between X and Y is truly linear (which it often is), they tend to perform extremely well.

But there are many situations in which X and Y are related, but in a non-linear way. What is there to do in such cases? Well, one technique is to simply modify the terms of the linear regression equation (using splines, segmenting the regression, adding polynomial terms, including interaction terms, etc.). But these models often end up being convoluted and lacking in interpretability (it’s not uncommon to catch yourself saying things like, “this models shows there is a 1.8-unit increase in the logged Y for every 1-unit increase in the quadradicness of X.”;  NO ONE wants to hear this except for stats geeks!) And compounding the problem, when multiple predictors are in the equation, the interpretation of that one coefficient gets even more complicated to explain.

Lucky for us, there is an algorithm that often outperforms the GLM, can deal with non-linearity, and also has the added benefit of being highly interpretable. I introduce to you, the decision tree.

What is a decision tree?

A decision tree is a purely data-driven (non-parametric) model that one can use for both regression and classification. It works by partitioning the data into nested subsets using a sequence of binary splits. The algorithm declares a root node attribute (at the top of the ‘tree’) and then progresses downward, making binary split decisions for how to subset the data until it is all optimally partitioned. At this point the splitting stops and the data resides in a terminal node (sometimes called a ‘leaf node’). The resulting structure is tree-like (really an inverted tree), with the root node at the top, decision nodes in the middle, and the leaf nodes at the bottom.

In this example above, the objective is to classify whether a customer will purchase a computer vs not. The predictor space is three dimensional (age, credit rating, and student status). In this case, all predictors are nominal, but decision trees can also work with continuous predictors. After a decision tree is built one simply feeds some new data into the tree, and the algorithm will trickle the data down until it lands in a leaf node. At this point, the computed prediction will be the modal or mean response of that particular leaf node. In the tree above, if a customer walks in that is a senior and has an excellent credit rating, you would predict that this customer is likely to purchase a computer.

How does a decision tree decide where to split?

So now that we know what a decision tree is, it’s natural to wonder how the tree is built and how all of this splitting actually works. In the example above you may also be wondering what determined age to be the root node and not credit rating or student status. Though specific details may vary depending on your implementation (there are a few), in essence, all trees will choose an attribute and split criterion that partitions the data into maximally homogeneous segments. Basically, at each step, the algorithm will consider every possible variable to split on and every possible ‘cutpoint’ value within the range of that variable. And the particular split criterion the algorithm chooses will produce a subset of data that is the most homogeneous.

How is homogeneity measured? Well, it depends on the structure of your predictors. In the example above, the predictors are all nominal variables and the response is also a nominal variable (purchase vs not). In such cases, a common splitting criterion is the value that minimizes entropy or analogously increases information gain, both of which you can read about here). Another algorithm that uses this basic idea (CHAID algorithm) will choose an attribute split that maximizes the test statistic in a chi-squared test of independence. A high test statistic here indicates the attribute and its splitting criterion are doing a good job of separating purchasers from non-purchasers. In the case of regression, the process is similar, but instead of homogeneity, the algorithm will compare predicted to actual responses, square them, and then choose a split that produces the greatest reduction in the sums of squared residuals (RSS).

Whichever criteria the algorithm uses (chi-square, RSS, entropy, information gain, etc.), the process recursively continues until there are no more splits. So after the tree makes its first split, it partitions the two resulting subsets of data again by splitting on some other value within that subset, and so on. The stopping criterion to determine no more splits is often set by the user, and can include metrics such as how many observations are assigned to each terminal node, how many splits are made, etc.

How do decision trees compare to linear models?

When a set of X’s is truly related to Y in some non-linear way, your standard least squares regression line will be biased (meaning a straight line doesn’t approximate the true nature of the relationship in the data). In such cases you can go ahead and fit a linear regression any way. It may even produce a respectable p-value and explain a significant portion of the overall variability in Y. But even in such cases when a linear model predicts decently, if the relationship is non-linear decision trees will typically outperform.

So, decision trees tend to perform better than linear models when there is a true non-linear relationship between X and Y. Another big advantage over the linear model is in its ease of interpretation. Though many linear regressions are fairly interpretable (it measures change in Y in units of X), things get much fuzzier when you start dealing with multiple predictors, interaction terms, polynomials, etc. In contrast, decision trees are extremely straightforward and can be readily understood by almost anybody (this is especially useful for communicating with those who may lack the proper training in analytics).

So picking up non-linearities and also ease of interpretation are the two main advantages of decision trees. Like with all algorithms, however, there is no free lunch. One disadvantage with decision trees is that they tend to model the signal so well that they actually often pick up a lot of the noise (this is called ‘overfitting’). Remember, there is always a random component in our data, and so really specific trees that can pick up peculiarities of the data structure may be modeling some of that random component and treating it like a systematic pattern. Luckily there are many methods to handle this pitfall of decision trees (ensemble methods, bagging, random forests, etc.). We won’t focus on these in this blogpost, but just know that there are ways to treat this scenario with decision trees.

How do I get started with decision trees?

Now that you have learned about decision trees, the best way to get better at using them is to go test them out! There are many great packages available in R and Python to get you started. The rpart package in R I would personally recommend. You can run it on the iris dataset in R and have a decision tree built in minutes!

In a future blog I will do a tutorial in R for building a decision tree and visualizing the results. It’s really quite simple. In the meantime, I welcome you to comment with any thoughts or questions!

The R Advantage – Tip #1

This is the first entry in a series (The R Advantage) that will spotlight useful R tips that I’ve picked up along the way. The audience is anyone from the absolute beginner who may be on the fence about switching to R, to the person who is just starting out in R and learning simple commands, to the more intermediate user who is looking to expand his or her R knowledge.

These tips are simple to execute, require no deep coding knowledge, and pay dividends in saving you time and generally making your life easier.

I’ll note that one thing I really would have appreciated when I was just starting R was to have a step-by-step, intuitive explanation of what code was doing at each line (so I will do this to make things clear for the beginners).

Without further ado, today’s R tip is: how to merge files that have unique observations (rows), but the same variable structure (columns).

Problem scenario: You have just run an experiment in a lab. Say you have collected reaction time data from participants, and your study produces 1 data file per participant. Come time for analysis, you have to somehow aggregate all of the data into a single file.

This problem frequently comes up in the social sciences (but probably shows up in many other fields too). And in fact, this is the first problem that led me to want to switch out of doing analysis with GUI’s like SPSS and to get further into unlocking the efficiency of R.

Before I knew how to use R, I would sometimes spend hours doing this aggregation process manually (opening each file, copying and pasting the data into a parent file, closing the child file, and then doing it again). It was exhausting, too mindless for me to enjoy, and was ultimately taking hours out of my day.

Solution: let R iterate through each file in your file directory, extract the contents, and build the central file for you.

Benefit: It literally saves you hours of time.

So here is the first chunk of code that personally convinced me of The R Advantage.

Note: this code assumes each of your data files are in the same folder (and are the only files in that folder). It also assumes the variable names across files are the same.


files_full <- list.files(getwd(), full.names=TRUE)
df <- data.frame() 

for (i in 1:length(files_full)) {
        currentFile <- read.csv[files_full[i]]
        df <- rbind(df, currentFile)
}

write.csv(df, "merged_data.csv")

Line 1: Lists all files in current directory and stores them into a variable called ‘files_full’

Line 2: Initializes an empty data frame (we will use this to build our central data file)

Line 4: Starts a for loop (a structure that ticks through a predetermined number of times – in this case, the number of files in the directory – and performs an operation on each iteration). Line 5 and 6 detail the operation.

Line 5: Reads in the data from the current participants’ file and stores it into a variable called ‘currentFile’

Line 6: Updates the central data frame (df) with the current participants’ file contents

Line 9: Writes and saves the central data frame to a merged data file in .csv format. 

That’s all for this entry in The R Advantage. I hope that you can see with just a few lines of code you can gain hours of time! I know you will want to try it out now. If you do, feel free to drop me a line in the comments and let me know how it works!

Poster accepted at INFORMS 2016

In November I’m headed to Nashville to present at the annual conference for the Institute for Operations Research and the Management Sciences (INFORMS). To preview some models I built and will be presenting, logistic and linear regression were used to make MLB predictions for 2016 (total wins and playoff status for both Chicago Cubs and White Sox). My next step is to build a webscraper to gather data and to perhaps do some agent-based modeling to enable day-by-day predictions for individual matchups (in addition to predicting the season-long trends).

I’m looking forward to presenting the initial data and of course meeting all the great people there!

A Bayesian Approach to the 2016 NCAA Tournament

My friend Chris and I decided to enter a Kaggle competition for the 2016 NCAA Tournament. March Madness is notoriously challenging to predict from a statistical point of view (see my alma mater MSU, a #2 seed, who earlier today lost in the first round to Middle Tennesee, a #15 seed. No one saw that coming. Thus is the nature of March Madness.

But barring those rare upset cases, we are actually pretty confident in our predictions, or at least the basis for them. We used a Bayesian technique to generate probabilities of each team winning, based on a latent ability variable (comprised of many “black-boxed” statistics). Our goal is to outperform a baseline model of simply choosing the higher-ranked team each round. You can see our predictions and read more about our model in the shared Dropbox folder below.

https://www.dropbox.com/sh/t4j1knw4706iixm/AAAjgEf_I68tx0oAiGEPKDdZa?dl=0

The 6 Principles of Effective Data Visualizations: Principle #1 – Show Comparisons, Contrasts, Differences

At their core, statistical analyses always involve making a comparison of something. We might want to know how voter turnout differs between Democrats and Republicans. Or scientists might test the effects of a new treatment by comparing an experimental group to a control group. We might just want to see how global temperature has changed over time. There are infinite examples that we could think up, but the key here is that analytics involves making comparisons of some sort.

It makes sense, then, that Principle #1 in making effective data visualizations is to Show Comparisons, Contrasts, Differences (in an interesting and appropriate way). A good display will show how two or more things (groups of people, treatment levels, years, states, etc.) compare to one another. Not only that, but our visualizations should draw out this comparison in a way that is engaging to the reader. Republicans might be better represented by the color red, for example, and Democrats in blue. Or if only one group differs from a number of other groups in some meaningful way, it would be sensible to draw attention to that one group with a visual representation of anomaly (perhaps by coloring its line on a scatterplot, while keeping all the other lines black). So, in sum, a good visualization will not only represent a comparison with the data, but it must do so in an interesting and sensible way.


Take a look at this example of a poor visualization. I got this from the site WTF Visualizations (which can help you learn what not to do in data science, or can at least give you a few laughs!). Any way, this violates our first principle in a number of ways.

principle1_bad_2

First, there does not seem to be any comparison being made in the display. The visualization is simply listing the functions and benefits of vitamin B3. No comparisons, no contrasts, nothing. This makes for really boring content to visualize right from the get go. Even the slickest graphs in the world would not make this “story” (if you can call it that) interesting.

Not only that, but much to the reader’s confusion, parts of this visualization make it seem like there is a comparison being made (where there is not). For example, two of the statements hint at having an empirical basis (“significant decrease in heart disease” and “reduction in the risk of Alzheimers” both use terminology that sounds like statistical findings from a research publication), whereas the other two statements do not (“Assists in the energy production of cells” and “promotes healthy skin”, compared to the first two, have a more broad and “conjecture”-like format). This makes it unclear to the reader why there is a difference in phrasing at all for these statements. And adding to the confusion, these two sets of statements also differ in that one begins with verbs (“assists” and “promotes”), whereas the other begins with nouns (“reduction” and “decrease”). In both these examples of inconsistency, the visualization misleads the reader into thinking there might be a comparison being made between sets of statements (when, in fact, there is not). This is not only annoying to the reader, but costs the reader energy and time. And always remember: No good visualization should make the reader do more work than is necessary. 

The second major flaw in this visualization (which – to me – is actually more bothersome than it having no comparisons) is that the designers use a Venn-diagram style of presenting the four statements, indicating conceptual overlap. Yet there is no such overlap (nor would we expect there to be in an analysis that simply lists a few facts). In this case, if the designers of this visualization were going to stick with the bland content, they should have at least been more prudent to think up a more appropriate way to represent their facts. I actually think a simple list format (albeit extremely boring) would be more suitable in this case. There are many, many good ways to format these statements. In this case the four overlapping circles is not one of them. Leave that for scenarios in which four statements do overlap in some meaningful way.

The one and only area in which this visualization might be said to succeed is in the choice of using coffee (a food source that is high in vitamin B3) to represent the function of vitamin B3. I find this to be an appropriate use of imagery. And I also like that it separates the function of vitamin B3 from its benefits (by presenting the coffee mug as distinct from the three transparent circles). All that said, in my opinion, these points of strength don’t outweigh the major flaws of this visualization.

Remember our Principle #1 states that effective visualizations 1) make a comparison, and 2) represent that comparison in an appropriate and interesting way. The above visualization does not meet either of these criteria and is therefore, a poor visualization.


So far we’ve seen how not to demonstrate Principle #1. In the next entry we will turn briefly to a visualization that is a good demonstration of our principle. Stay tuned!

 

2016 MLB Predictions – Chicago Teams

Enjoy the analysis! I had fun learning how to do this, and then doing it!

Introduction

Sabermetrics is the empirical analysis of baseball. Baseball has always been a game of statistics. But with the explosion of data in the past 7 years, in conjunction with the development of more advanced analytics techniques, baseball teams now – more than ever – can leverage their numbers in increasingly sophisticated ways. There is so much data available that even self-taught lay people – with the proper tools at their disposal – can be a part of the action.

Source: http://www.datanami.com/2014/10/24/todays-baseball-analytics-make-moneyball-look-like-childs-play/

In this report, I use linear regression to predict a number of outcomes for both the Chicago Cubs and the Chicago White Sox in 2016. After extrapolating from just a few linear models, I found projections that:

  • The Chicago White Sox will win 87 games in 2016 and will NOT make the playoffs.
  • The Chicago Cubs will win 108 games in 2016 and WILL make the playoffs.

Loading the Data

First, let’s load in a dataset of MLB statistics from 1962-2015.

baseball <- read.csv("baseball.csv")
str(baseball)
## 'data.frame':    1322 obs. of  15 variables:
##  $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
##  $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
##  $ Year        : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ RS          : int  720 573 713 748 689 622 640 669 737 689 ...
##  $ RA          : int  713 760 693 753 608 701 754 640 844 803 ...
##  $ W           : int  79 67 81 78 97 76 64 81 68 74 ...
##  $ OBP         : num  0.324 0.314 0.307 0.325 0.321 0.306 0.312 0.325 0.315 0.328 ...
##  $ SLG         : num  0.414 0.359 0.421 0.415 0.398 0.38 0.394 0.401 0.432 0.42 ...
##  $ BA          : num  0.264 0.251 0.25 0.265 0.244 0.25 0.248 0.256 0.265 0.27 ...
##  $ Playoffs    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ RankSeason  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RankPlayoffs: int  NA NA NA NA 5 NA NA NA NA NA ...
##  $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
##  $ OOBP        : num  0.322 0.339 0.321 0.327 0.29 0.322 0.328 0.297 0.353 0.33 ...
##  $ OSLG        : num  0.423 0.426 0.418 0.424 0.372 0.41 0.42 0.386 0.461 0.444 ...

The dataset contains many variables related to historical team performance. The table below describes each variable:

Variable Name Variable Description
Team MLB team abbreviations (e.g. ARI = Arizona Diamondbacks)
League American League (AL) or National League (NL)
Year Year
RS Runs scored by team’s batters
RA Runs allowed by team’s pitchers
W Total wins in regular season
OBP On-base percentage: a measure of how often a batter reaches base
SLG Slugging percentage: a measure of the power of a hitter (total bases/total at bats)
BA Batting average: a measure of the average performance of a batter (hits/total at bats)
Playoffs Dichotomous variable representing whether or not a team made the playoffs
RankSeason How each team ranked overall in the regular season
RankPlayoffs How each playoff team ranked in the playoffs
G Total games played
OOBP Opponent’s OBP: a measure of how often a team lets the opponents reach base
OSLG Opponent’s SLG: a measure of how far along the bases a team allows the opponent to get

Framing the Analysis

The primary goal of this analysis is to determine if either of the Chicago baseball teams will make it to the playoffs in 2016. To do this, it would be helpful to know how many regular season games a team must typically win to get to the playoffs.

Making it to the playoffs

Graphs shows that historically, if a team wins 94 or more games, they have a strong chance of making it to the playoffs.

Image of playoff wins

Click here for more information about this chart.

Winning 94 games So how does a team win games?

A baseball win happens when you outscore your opponent. We will create a variable, “RD” (runs differential) to represent the extent to which a team outscores their opponent. RD is simply calculated by subtracting a team’s runs allowed from their runs scored.

baseball$RD <- baseball$RS-baseball$RA

Indeed, historically there is a strong relationship between RD and wins.

The graph shows a clear linear pattern. But let’s actually put some numbers on this relationship.

Building a Model

We will build a linear model to predict wins with runs differential.

winsReg <- lm(W ~ RD, data = baseball)
summary(winsReg)
## 
## Call:
## lm(formula = W ~ RD, data = baseball)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4023  -2.7977   0.0735   2.8746  12.7747 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 80.910741   0.110676  731.06 <0.0000000000000002 ***
## RD           0.104284   0.001082   96.39 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.024 on 1320 degrees of freedom
## Multiple R-squared:  0.8756, Adjusted R-squared:  0.8755 
## F-statistic:  9290 on 1 and 1320 DF,  p-value: < 0.00000000000000022

The model summary indicates there is is a strong linear relationship between runs differential and wins. The p-value is highly significant, and the model produced an R^2 value that is typically considered high.

Using the Model

So now we’ve seen with both a chart and a model summary that RD is a good predictor of regular season wins. One useful application of the model summary is that now we can compute just how big the run differential has to be for a team to have a strong chance of winning 94 games and then making it to the playoffs.

Remember our summary showed us that:

  • Wins = 80.910741 + 0.104284(RD)

Thus, for a team to have a strong chance of making it to the playoffs, the value of (80.910741 + 0.104284*RD) must be greater than or equal to 94.

Do some simple arithmetic:

thresholdRD <- (94-80.910741)/0.104284
thresholdRD
## [1] 125.5155

To win 94 games and have a strong chance of making the playoffs, in the regular season a team must outscore their opponents by 126 runs.



Doing the Analysis

Now that we have a couple of key figures in hand, let’s proceed with the actual analysis.

Remember, we want to determine if either of the Chicago baseball teams will make it to the playoffs this year. We know that if a team’s RD is at least 126, they will probably win 94 games and get to go to the playoffs.

So how can we predict the RD for each team in 2016? We will predict how many runs each team will score in 2016 (this can be done with batting statistics) and then how many runs each team will allow in 2016 (this can be done with pitching statistics).


Building a Model for Runs Scored

Runs scored (RS) can be predicted by OBP (on base %, includes walks) and SLG (how far a player gets around the bases at his at-bat; this measures a batter’s power).

Many teams used to focus on batting average. However OBP is well-known to be the more important predictor. Baseball analysts also consider SLG to be important.

Let’s confirm this with a regression model.

runsReg <- lm(RS ~ OBP + SLG + BA, data = baseball)
summary(runsReg)
## 
## Call:
## lm(formula = RS ~ OBP + SLG + BA, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.400 -16.779  -1.008  16.935  93.445 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -811.41      16.66 -48.707 <0.0000000000000002 ***
## OBP          2941.02      94.69  31.061 <0.0000000000000002 ***
## SLG          1525.85      36.58  41.708 <0.0000000000000002 ***
## BA           -154.84     111.65  -1.387               0.166    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.42 on 1318 degrees of freedom
## Multiple R-squared:  0.9212, Adjusted R-squared:  0.921 
## F-statistic:  5137 on 3 and 1318 DF,  p-value: < 0.00000000000000022

Note: Both OBP and SLG are significant predictors of runs scored. BA is not. Moreover, note that BA is also negative, meaning that the lower the BA, the more runs scored. This obviously makes no sense. In statistics, when this happens, it usually means there is multicollinearity in the data (model summaries typically generate the amount of unique variance accounted for by each of the predictors. Because OBP, SLG, and BA are all likely correlated with one another, once OBP and SLG have accounted for their unique portion of the variability, there is very little left over for BA to explain). Thus, for this reason, we can simplify our model by removing BA.

Now we’ll re-run the model with only OBP and SLG as predictors.

runsReg2 <- lm(RS ~ OBP + SLG, data = baseball)
summary(runsReg2)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.871 -16.777  -1.116  16.998  93.008 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -817.51      16.07  -50.86 <0.0000000000000002 ***
## OBP          2859.24      74.11   38.58 <0.0000000000000002 ***
## SLG          1507.33      34.07   44.24 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.43 on 1319 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic:  7700 on 2 and 1319 DF,  p-value: < 0.00000000000000022

The new model is much better. With only two predictors, it is more parsimonious, and it has almost the exact same R^2 value as the previous model.

Looking at the coefficients for this model, OBP is almost twice as large as SLG. Because they are on the same scale, this tells us that OBP is a better predictor of runs scored than SLG (OBP has about twice the predictive power of SLG).

In sum, BA is overvalued, SLG is important, but OBP is more important.


Building a Model for Runs Allowed

Runs allowed (RA) can be predicted by OOBP (opponent’s on base % allowed by the pitchers) and OSLG (opponent’s SLG allowed by the pitchers).

Let’s build a model to predict RA with OOBP and OSLG.

ramodel <- lm(RA ~ OOBP + OSLG, data = baseball)
summary(ramodel)
## 
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.517 -17.954   0.902  18.424  71.930 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -851.48      24.63  -34.57 <0.0000000000000002 ***
## OOBP         2777.06     138.80   20.01 <0.0000000000000002 ***
## OSLG         1639.65      81.13   20.21 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.68 on 507 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9139 
## F-statistic:  2702 on 2 and 507 DF,  p-value: < 0.00000000000000022

This model is strong; both OOBP and OSLG are significantly related to RA.


2016 Predictions

So far we have established a number of key facts in this analysis:

  • If a team wins 94 games they will probably make it to the playoffs
  • If a team outscores their opponents by 126 runs (RD >= 126), they have a strong chance of meeting the 94-game threshold.
  • A team’s RD is predicted by RS and RA.
    • RS is predicted by OBP and SLG.
    • RA is predicted by OOBP and OSLG.

Given these figures, we will extrapolate from our models to predict OBP, SLG, OOBP, and OSLG for each Chicago team in 2016. Then with those projections, we will predict how many games each Chicago team will win in 2016.

Methodology: 2016 Batting Figures

These numbers were available on the internet.

First, I obtained the projected 2016 roster for both the White Sox and the Cubs.

Then, for each player, I computed composite figures for OBP and SLG, using the results of 4 different sabermetrics projection systems:

By using multiple systems, I wanted to lessen the impact that any one system’s error had on the overall prediction.

Each system is slightly different (you can read more about their own methodologies by clicking the links above), but all of them involve some kind of weighted average of the previous 3 years’ statistics.

Methodology: 2016 Pitching Figures

I obtained the projected 2016 lineup and bullpen for both the White Sox and the Cubs.

Unfortunately, unlike the batting statistics, the 4 available sabermetrics systems did NOT include projections for OOBS and OSLG.

To make these projections, I simply used a weighted average of the previous 3 years’ OOBS and OSLG data for each pitcher in both the lineup and the bullpen, and then weighted those figures based on how much playing time each pitcher gets.

Weighted Average Based on Past 3 Years

The weights I used were generated in a previous sabermetrics analysis. This analysis showed for a given player, how predictive were each of his previous season’s data for his current season figures. The authors came up with weights of .47, .32, and .18, respectively for each of the previous 3 years.

Note: 3 pitchers in my analysis only had 2 seasons worth of data. In these cases, I used weights of 0.6 and 0.4 (roughly equivalent to the ratio of (.47/.32)). For players with only 1 year’s worth of data, I weighted that data at 1. I excluded all rookie players from the analysis.

Weighted Average Based on Playing Time

After generating each pitcher’s weighted OOBP and OSLG, I then weighted each of those figures based on how much playing time a pitcher gets (to do this, I computed the proportion of batters faced that each pitcher contributed to the team’s total batters faced in the 2015 season). My rationale was that the pitchers who face more batters, such as starting pitchers, should be weighted more heavily.



2016 Predicted Figures

Using these methodologies, I obtained the following figures:

Team Statistic Projected Figure
White Sox OBP 0.321666667
White Sox SLG 0.41475
Cubs OBP 0.34253125
Cubs SLG 0.43878125
White Sox OOBP 0.320957143
White Sox OSLG 0.38205
Cubs OOBP 0.29843875
Cubs OSLG 0.353745

It appears that in 2016 the Chicago Cubs will perform better in both batting and pitching than the Chicago White Sox.


Predicting Outcomes for the Chicago White Sox

But what does this mean for the White Sox in terms of making the playoffs?

Remember: if we can predict the runs scored and runs allowed for a team, we can also predict how many games they will win.

Using the runs scored linear regression model (the one that uses OBP and SLG as independent variables), we can find the number of runs we expect the White Sox to score:

summary(runsReg2)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.871 -16.777  -1.116  16.998  93.008 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -817.51      16.07  -50.86 <0.0000000000000002 ***
## OBP          2859.24      74.11   38.58 <0.0000000000000002 ***
## SLG          1507.33      34.07   44.24 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.43 on 1319 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic:  7700 on 2 and 1319 DF,  p-value: < 0.00000000000000022
obp_chw_2016 <- 0.321666667 
slg_chw_2016 <- 0.41475
runspredicted_chw_2016 <- -817.51 + 2859.24*obp_chw_2016 + 1507.33*slg_chw_2016
runspredicted_chw_2016
## [1] 727.3773

The White Sox are predicted to score ~727 runs in 2016.

Now we can also find the predicted runs allowed, using the runs allowed regression. Remember, this uses opponents OBP (OOBP) and oppenents SLG (OSLG).

summary(ramodel)
## 
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.517 -17.954   0.902  18.424  71.930 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -851.48      24.63  -34.57 <0.0000000000000002 ***
## OOBP         2777.06     138.80   20.01 <0.0000000000000002 ***
## OSLG         1639.65      81.13   20.21 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.68 on 507 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9139 
## F-statistic:  2702 on 2 and 507 DF,  p-value: < 0.00000000000000022
oobp_chw_2016 <- 0.320957143
oslg_chw_2016 <- 0.38205
runsAllowedPredicted_chw_2016 <- -851.48 + 2777.06*oobp_chw_2016 + 1639.65*oslg_chw_2016
runsAllowedPredicted_chw_2016
## [1] 666.2655

The White Sox are predicted to allow their opponents to score ~666 runs 2016.

Now we can calculate the predicted difference between runs scored and runs allowed.

predictedRD_chw <- runspredicted_chw_2016-runsAllowedPredicted_chw_2016
predictedRD_chw
## [1] 61.11179

How many games will White Sox win? We can use the runs differential regression to predict 2016 wins.

wins_chw <- 80.910741 + 0.104284*61.11179
wins_chw
## [1] 87.28372

Thus, the Chicago White Sox are predicted to win 87 games (<94), meaning they will probably NOT make the playoffs in 2016.


Predicting Outcomes for the Chicago Cubs

Using the same runs scored linear regression model, we can find the number of runs we expect the Cubs to score:

summary(runsReg2)
## 
## Call:
## lm(formula = RS ~ OBP + SLG, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.871 -16.777  -1.116  16.998  93.008 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -817.51      16.07  -50.86 <0.0000000000000002 ***
## OBP          2859.24      74.11   38.58 <0.0000000000000002 ***
## SLG          1507.33      34.07   44.24 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.43 on 1319 degrees of freedom
## Multiple R-squared:  0.9211, Adjusted R-squared:  0.921 
## F-statistic:  7700 on 2 and 1319 DF,  p-value: < 0.00000000000000022
obp_chc_2016 <- 0.34253125
slg_chc_2016 <- 0.43878125
runspredicted_chc_2016 <- -817.51 + 2859.24*obp_chc_2016 + 1507.33*slg_chc_2016
runspredicted_chc_2016
## [1] 823.2572

The Cubs are predicted to score ~823 runs in 2016.

Now we will use the runs allowed regression again to predict how many runs the Cubs will allow.

summary(ramodel)
## 
## Call:
## lm(formula = RA ~ OOBP + OSLG, data = baseball)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.517 -17.954   0.902  18.424  71.930 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  -851.48      24.63  -34.57 <0.0000000000000002 ***
## OOBP         2777.06     138.80   20.01 <0.0000000000000002 ***
## OSLG         1639.65      81.13   20.21 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.68 on 507 degrees of freedom
##   (812 observations deleted due to missingness)
## Multiple R-squared:  0.9142, Adjusted R-squared:  0.9139 
## F-statistic:  2702 on 2 and 507 DF,  p-value: < 0.00000000000000022
oobp_chc_2016 <- 0.29843875
oslg_chc_2016 <- 0.353745
runsAllowedPredicted_chc_2016 <- -851.48 + 2777.06*oobp_chc_2016 + 1639.65*oslg_chc_2016
runsAllowedPredicted_chc_2016
## [1] 557.3203

The Cubs are predicted to limit their opponents to scoring only 557 runs.

Now we will calculate the predicted difference between runs scored and runs allowed for the Cubs.

predictedRD_chc <- runspredicted_chc_2016-runsAllowedPredicted_chc_2016
predictedRD_chc
## [1] 265.9369

And with an RD ~266, how many games will the Cubs actually win?

wins_chc <- 80.910741 + 0.104284*predictedRD_chc
wins_chc
## [1] 108.6437

The Chicago Cubs are predicted to win 108 games (>94), meaning they WILL probably make the playoffs in 2016.

This RD and number of wins is a relatively high number. To put this in perspective, remember this graph:

This graph contains historical data for 50+ years of baseball. A run differential of 265 would position the 2016 Cubs as one of the highest performing baseball teams of all time. Could be a great year for Cubs fans!


Summary

  • The Chicago White Sox are projected to win 87 games in 2016 and NOT make the playoffs.
  • The Chicago Cubs are projected to win 108 games in 2016 and TO make the playoffs. Moreover, there is a good chance the 2016 Cubs will have an outstanding season!

The 6 Principles of Quality Data Visualizations (Introduction)

90% of the world’s data has been created in the past 2 years. Big data is here; and it is here to stay.

And along with this increase in the world’s data, there has been a commensurate increase in the demand to communicate it. Here is where data visualizations come in.

Data visualizations are simply visual representations of data. These include charts, maps, graphs, etc. The primary reason to visualize data is to make your analyses more intuitive to the audience. A quality visualization can also make or break a presentation. I’ve seen it happen. Clunky graphs with hard-to-read labels can leave the audience confused, whereas a crisp, fluent display can really stand out and become memorable. There is even some evidence that neuroscience findings are considered more believable if they are represented on a colorful image of a brain compared to just shown in a table of statistical results.


In a series of upcoming blog entries, I am going to examine 6 fundamental principles behind great data visualizations. These principles are based on Edward Tufte’s book Beautiful Evidence (which I – along with many others – highly recommend). If you follow these principles, you will be making outstanding visualizations in no time!

In the next entry, I will focus on the first of these 6 principles: namely, that a good visual display should show comparisons, contrasts, and/or differences.

In the entry I will draw a bit from my background in perceptual psychology and cognition. I will first focus on the underlying reasons behind why this technique is effective, and then I will demonstrate how to use this technique to make effective visualizations.

Stay tuned!

My Second Data Visualization

This is my second visualization. I am particularly proud of the “story” I told here with this data. It can get complicated, but I think it is interesting. The idea came to me after reading an opinion piece in the New York Times that detailed the many benefits of increasing access to education for women. Now the focus of this article was on developing countries, but I thought perhaps there was already data on this topic from the U.S.  Turns out there was. So I sought out the data, did a little bit of an exploratory analysis, found some interesting patterns, and turned it into a graph using R. Here is what I found:

visualization_gender_parity.png

At four different levels of education, this chart tracks the ratio of male to female average income over time and the ratio of males to females in the workforce. Thus, on the Y axis, the value of 1.0 represents parity (a 1:1 ratio of male to female).

As you can see, with the exception of those who only obtained a high school diploma, as the ratio of male to female graduates approaches parity, so too does the ratio of male to female income levels. Now, of course, one should be cautious in accepting correlational research for anything more than just correlations. The point of this chart is not to imply an underlying mechanism for the observed effect (e.g. that increasing access to education for women causes more income parity by gender). It is simply showing that these two phenomena are naturally co-occurring.

 

In my next post, I will give some tips on how to make a quality visualization. In short, it involves both a compelling story AND quality aesthetics. I will give some pointers about each of these components, to help you get the most out of your visualizations.