Linear regression is perhaps the most widely used statistical tool in data science. It is a fundamental building block for statistical inference and prediction, and many more advanced methods derive from simple linear regression. This blogpost will walk you through the basic concept of linear regression. It will also provide a code tutorial to perform a linear regression in R.
By the end of this blogpost you will know:
- What linear regression is
- How linear regression works
- How to perform linear regression in R and how to interpret its output
What is linear regression?
Linear regression is a statistical method that models the relationship between two continuous quantitative variables.
As you can see below, we are investigating the relationship between car weight and gas mileage. It appears that a downward sloping line describes this relationship. Generally, as car weight increases, gas mileage decreases.
Now, just eye-balling this graph, one can see that the line does a pretty good job of capturing the overall downward “trend”. However, it is also clearly an imperfect model of the data (as all models by definition are). For example, many points are above the line and many are also below it. In fact, the line doesn’t pass through most of the data points. This means there is also some “scatter” or “error” about the line.
The goal of linear regression is to fit a line to the data that minimizes the scatter.
How does linear regression work?
Linear regression works by finding a line of best fit to pass through the data. Infinitely many lines can be fit to the data, but the best fitting line (called the “line of best fit”) will be the one that minimizes scatter. The metric that quanitifes scatter is called the sum of squared residuals (aka sum of squared error or SSE).
Each data point has a corresponding “residual,” which is just a quantity representing the vertical distance between it and the line of best fit. In the plot below all the residuals are demarcated by black lines. You can see each residual represents how far away is the data point from its estimated point on the line. In other words, the residual represents how much ‘error’ that data point contributes to the model.
A goal of linear regression is to minimize the overall model error. This should make intuitive sense. The less error there is in a model, the better it is fitting to the data. Therefore, we want to treat both positive residuals (those above the line) and negative residuals (those below the line) as equally ‘bad’. So we square each residual value and sum them up. The line of best fit is the line that minimizes the sum of these squared residuals.
You can see the line of best fit above for estimating gas mileage in mpg is calculated to be: 37.3-5.34*car weight(in 1000’s of pounds)
This equation tells us that any car’s gas mileage (mpg) is predicted to be 37.3 – 5.34its weight in thousands of pounds. So if a car weighs 3000 pounds, we would estimate its mpg to be (37.3-5.343) = 21.28.
You may be wondering how the linear regression model settled on that equation. Under the hood, the solution involves matrix algebra and other fancy inversion operations. We won’t go into these details here, but there is a closed-form analytic solution. Luckily in practice computers will perform this for us, as it is painful to do by hand.
Now let’s learn how to build our own linear regression model in R. We’ll walk through it step-by-step.
Applying Linear Regression: A Lab
For this lab we will use the stock data set mtcars in R. Load up the data and let’s check out the variables.
As you can see this data set contains information about cars, including their miles per gallon (mpg) weight (wt), number of cylinders (cyl), etc.
We will build a simple linear regression model that predicts mpg with wt.
mod <- lm(mpg ~ wt, data = mtcars)
Notice we have to specify three parameters:
- The outcome variable to be predicted (mpg; sometimes called the dependent variable)
- The predictor variable (wt; sometimes called the independent variable)
- The data set (in this case mtcars).
Now let’s check the results!
You can see the summary statistics from our model above.
The Estimate column gives us the estimates of our intercept and slopes on our line of best fit. Recall from earlier that our line of best fit for predicting mpg from wt was given by 37.3 – 5.34*wt.
Regarding the intercept, it is mostly just there to fit the line. It technically means ‘the estimated value of Y when X = 0), but in most regression contexts that doesn’t translate into anything of theoretical interest. Sometimes it flat out doesn’t even make sense. For example, having negative values of weight is impossible, but such negative intercepts can often show up just to stabilize the line.
The interpretation for the predictor slope coefficient is that for every 1-unit increase in car weight (in this context 1 unit =1000 pounds), that car’s gas mileage decreases by 5.34 mpg. It decreases because the coefficient is negative. If it were positive, the translation would change to “increases.”
The Standard Error column tells us how much precision we had in estimating the respective terms. Generally if there is wider scatter there will be a higher standard error. This makes sense visually. See below two models:
The model in the left pane is the model we just fit. The model in the right pane is a model I just made up (which is just wt predicting 3*mpg). You can see there are larger residuals in this second model. This model therefore has less precision in its estimate of the line of best fit, and consequently higher standard error. Generally, you want the data points tightly hugging the line (like what can be seen in the left pane). When data points are further away from the line it indicates less precision.
So standard error is a quantity representing the precision in our estimates. So you don’t have to scroll, here is the same table of summary statistics as above:
The t-value gives us a test statistic for our intercept and slope estimates. It represents how many standard errors away from 0 is that obtained value on a null distribution. Though we have to sacrifice some detail here to make this explanation intuitive, you can crudely think of the t value as a measurement of the size of the effect (in particular, the size of the effect in terms of standard errors).
A more intuitive metric to understand (that is directly related to the t-statistic) is the p-value. The p-value basically tell us how ‘surprising’ our obtained results are, or how strong is the evidence of a real effect. We generally want small p-values. P-values below .05 indicate that an effect is statistically significant. Such a p-value can be translated as ‘there is less than a 5% chance of obtaining the results we did if there were truly no effect.’ In our output above, the p-value column indicates that wt is significantly related to mpg.
Notice too, both Residual standard error and Multiple R-squared are metrics for judging how well our line fits the data. Generally lower residual standard error and higher R-squared indicate a better fit to the data. R squared tells you what proportion of the overall variance is explained by the model. The mathematics are pretty simple. You can look them up in your spare time.
That sums up the output in R. Another cool trick in R Studio is to type the $ symbol after your model object. It will generate a list of summary statistics from your model that you can explore further.
If you want to check out the coefficients, residuals, fitted values, etc., you can do that.
This can really come in handy if you want to use any of these values elsewhere in your analysis. For example, if you need to use the sum of squared residuals, you can do that by typing:
Congratulations! You are now officially educated in linear regression! You know what linear regression is, what it is used for, how it works, and also how to do it in R.