Linear regression is perhaps the most widely used statistical tool in data science. It is a fundamental building block for statistical inference and prediction, and many more advanced methods derive from simple linear regression. This blogpost will walk you through the basic concept of linear regression. It will also provide a code tutorial to perform a linear regression in R.
By the end of this blogpost you will know:
- What linear regression is
- How linear regression works
- How to perform linear regression in R and how to interpret its output
What is linear regression?
Simple linear regression is a statistical method that describes the relationship between two continuous quantitative variables. What makes it linear is that it uses a line to describe the relationship.
As you can see below, we are describing the relationship between car weight and gas mileage. And it appears that a downward sloping line describes this relationship well. Generally, as car weight increases, gas mileage decreases.
But notice that whereas the line does a good job of capturing the overall downward “trend” of this relationship, it is also clearly an imperfect model. As you can see, many points are above the line and many are below it. In fact, the line does not even come close to passing through most of the data points. This means there is also some “scatter” or “error” about the line.
The goal of linear regression is to fit a line to the data that minimizes the scatter.
How does linear regression work?
Linear regression works by finding a line of best fit to pass through the data. Infinitely many lines can be fit to the data, but the best fitting line (called the “line of best fit”) will be the one that minimizes scatter. The metric that is used to quantify scatter is called the sum of squared residuals (sometimes called sum of squared error or SSE for short).
Each data point has a corresponding “residual,” which is just a quantity representing the vertical distance between it and the line of best fit. In the plot below all the residuals are demarcated by black lines. You can see each residual represents how far ‘off’ the data point is from its estimated point on the line. In other words, the residual represents how much ‘error’ that data point contributes to the model.
A tacit goal of linear regression (or really any statistical model) is to minimize the overall model error. This should make intuitive sense. The less error there is in a model, the better it is fitting to the data. Therefore, we want to treat both positive residuals (those above the line) and negative residuals (those below the line) as equally ‘bad’. So we square each residual value and sum them up. The line of best fit is the line that minimizes the sum of these squared residuals.
You can see the line of best fit above for estimating gas mileage in mpg is calculated to be: 37.3-5.34*car weight(in 1000’s of pounds)
This equation tells us that any car’s gas mileage (mpg) is predicted to be 37.3 – 5.34its weight in thousands of pounds. So if a car weighs 3000 pounds, we would estimate its mpg to be (37.3-5.343) = 21.28.
You may be wondering how the linear regression model settled on that equation. Under the hood, the solution involves doing matrix algebra and other fancy inversion operations. We won’t go into these details here. Luckily in practice modern computers will take care of this for us.
Now let’s learn how to build our own linear regression model in R. We’ll walk through it step-by-step.
Applying Linear Regression: A Lab
For this lab we will use the stock data set mtcars in R. Load up the data and let’s check out the variables.
As you can see this data set contains information about cars, including their miles per gallon (mpg) weight (wt), number of cylinders (cyl), etc.
We will build a simple linear regression model that predicts mpg with wt.
mod <- lm(mpg ~ wt, data = mtcars)
Notice we have to specify three parameters:
- The outcome variable to be predicted (mpg; sometimes called the dependent variable)
- The predictor variable (wt; sometimes called the independent variable)
- The data set (in this case mtcars).
Now let’s check the results!
You can see the summary statistics from our model above.
The Estimate column gives us the values for our line of best fit equation. Recall from earlier that for predicting mpg from wt this was given by 37.3 – 5.34*wt. Importantly, the intercept is there just to fit the line, but in most regression contexts it doesn’t mean anything theoretically. However, the correct interpretation for the predictor coefficient is that for every 1-unit increase in car weight (in this context 1 unit =1000 pounds), that car’s gas mileage decreases by 5.34 mpg. It decreases because the coefficient is negative. If it were positive, the translation would change to “increases.”
The Standard Error column tells us how much precision we had in estimating the respective terms. Generally if there is wider scatter, there will be a higher standard error. This makes sense visually. See below two models:
The model in the left pane is the model we just fit. The model in the right pane is a model I just made up (which is just wt predicting 3*mpg). You can see there are larger residuals in this second model. This model therefore has less precision in its estimate of the line of best fit. Generally, you want the data points tightly hugging the line (such as what is illustrated by the model in the left pane). When data points are further away from the line it indicates less precision.
So standard error is a quantity representing the precision in our estimates. Here is the same table of summary statistics as above, so you don’t have to scroll.
The t-value column gives us a test statistic for our obtained equation values. It represents how many standard errors away from 0 is the obtained value on a null distribution. Though we have to sacrifice some detail here to make this explanation intuitive, you can crudely think of the t value as a measurement of the size of the effect (in particular, the size of the effect in terms of standard errors). However, a more intuitive metric to understand (that is directly related to the t-statistic) is the p-value.
The p-value basically tell us how ‘surprising’ the obtained results are, or how strong the evidence is for us having an effect. We generally want small p-values. P-values below .05 indicate that an effect is statistically significant. A p-value of .01 is thus even more significant, and so on. In our output above, the p-value column indicates that wt is significantly and strongly related to mpg.
Notice too, both Residual standard error and Multiple R-squared are metrics for judging how well our line fits the data. Generally lower residual standard error and higher R-squared indicate a better fit to the data.
That sums up the output in R. However, another cool trick in R Studio is to type the $ symbol right after your model object. It will generate a list of summary statistics from your model that you can explore further.
If you want to check out the coefficients, residuals, fitted values, etc., you can do that.
This can really come in handy if you want to feed any of these values elsewhere in your analysis. Say for example you wanted the exact sum of squares residuals value. You can obtain it by typing:
Congratulations! You are now officially more educated in linear regression! You know what linear regression is, what it is used for, how it works, and also how to do it in R. I hope that you found this useful. I would be interested to hear about your experiences with linear regression, so please do leave some thoughts below!