Tutorial 5: Linear regression

Learning goals

In this tutorial you will learn to:

  • Make scatterplots with linear regression lines
  • Interpret the output of linear regression
  • Examine the residuals of linear regression

Best-fit parameters

Linear regression

Linear regression is a method for finding the best-fit line to a two-variable data set. One of the variables is called the explanatory (independent) variable and the other is called the response (dependent) variable. The best-fit line is defined as the line whose graph passes the closest to the data points as measured by the sum of squared differences between the predicted and observed response variable values.

Here is an example of a linear regression performed and the line plotted over the scatterplot. The R function for linear regression lm() has two required inputs: an expression in the form of Response ~ Explanatory and the name of the data frame (if the variables are part of one). The script below uses the data set of heart rates from students from the BIOS 20151 classes, generates a plot of the data, calls lm() and assigns the output to a variable fit_heart and then plots the resulting best-fit line using the function abline():

The function lm() calculated the slope and intercept of the best-fit line. These values are bundled into the output of the function, which we assigned to fit_heart. This script prints them out:

As you can see, both parameters are part of the vector variable coefficients. If needed, they can be accessed separately by using the indexing notation:

interpreting the output of linear regression

We have seen how to obtain the slope and the intercept, but there’s a lot more in the linear regression output! If you run these scripts in RStudio you will see the object fit_heart in the Environment window (by default, top right). If you double click on it, you’ll see that it’s a list object that contains many variables (like a data frame, except messier because the different variables don’t have the same length). Let us use the data set from the palmerpenguins package that we saw in Tutorial 3 and run linear regression on body_mass_g as a function of bill_lenth_mm:

As you see, the function summary() gives us the key information, including the values of the parameters (plus some statistics to indicated the uncertainly around those values) and the R-squared, also known as the coefficient of determination. While the slope and the intercept tell which line provides the best fit, they tell us nothing about how good the fit is.

Coefficient of determination (R-squared)

The coefficient of determination or \(R^2\) of a linear regression is a measure of the goodness of fit of the line to the data. Numerically, it represents the fraction of variance of the response variable that is explained by the model.

The value of R-squared is not accessible directly in pen_fit, but it is printed out in the summary under the name of Multiple R-squared, together with Adjusted R-squared, which is a slightly modified version. To print out the coefficient of determination by itself, we have to use summary() with variable name r.squared:

As above, we can plot the best-fit line over a scatterplot of the data by using abline() with pen_fit as an input:

plotting the residuals

After performing linear regression it is essential to check that the residuals obey the assumptions of linear regression. The residuals are the difference between the predicted response variable values and the actual values of the response variable, in this case the penguin body mass. The residuals are contained in the object pen_fit as a variable named residuals. Residuals are the leftover errors of the response variable, so one way to visualize them is to plot them against the explanatory variable. The values of the explanatory variables which were used for the linear regression are also stored in the object pen_fit within the variables model, which contains the two variables passed to lm():

NOTE: I used the values of bill length from the pen_fit object because several of the values in the original data set had missing values (NA) and were discarded by lm(), so there are no corresponding residuals for those values and the plot() function would have failed if I had used the penguins$bill_length_mm instead as the x-variable.

By visual inspection it seems that the residuals satisfy the assumptions of being independent of measurement (shapeless scatterplot), are centered at zero, and look roughly normally distributed, although that can be checked more carefully using other tools.

For an example of these fancy tools, you can use plot() with the output of a linear regression, which will produce several diagnostic plots to make sure the residuals obey the assumptions of normally distributed with equal variance:

Exercises:

  1. Calculate the mean and standard deviation of the residuals from the previously calculated linear regression output pen_fit and use print() to report them:

Use functions mean() and sd(); specify the dataframe as well as variable, e.g. df$var; put these function inside print().

  1. Calculate and print the fraction of variance explained by the linear regression by dividing the variance of the predicted values of the previously calculate linear regression object pen_fit (in the variable fitted.values) by the variance of the observed values (in the variable model$XXX where XXX stands for the name of the response variable).

Calculate the ratio of variances inside the print() function; make sure that all the parentheses are matched correctly

  1. Perform linear regression on the penguin data set with the bill_depth_mm as the explanatory variable and body_mass_g as the response variable and assign it to variable fit_pen. Print out the coefficient of determination of this linear regression.

Need to first assing the output of regression, e.g. fit_pen <- lm( ... ); follow the example above of printing out the R-squared.

  1. Plot the residuals from the linear regression as a function of the independent variable as contained in the fit_pen object within the variable model$XXX where XXX is the name of the explanatory variable.

Follow the example above of plotting residuals using the output of lm, e.g fit$residuals