Tutorial 3: Data frames and descriptive statistics

Learning goals

In this tutorial you will learn to:

Load data into data frames
Compute statistics for different variables
Visualize data sets using histograms
Visualize data sets using box plots

Working with data frames

R is first and foremost a programming language for working with data. To do this, we first need to load the data into the R environment so R can perform actions on it.

Data comes in many different formats, for example a formatted text file, an Excel spreadsheet, or a database file like SQL. We will stick with simple formatted text files, which consist of lines that have a common format. So each line has a certain number of values (numeric, text, etc.) that are separated by a delimiter, which is a special character that is used only to divide values, for example commas in csv files (comma separated variables) and tabs in tsv files (tab separated variables).

loading data from a file

The first way to load data is to read in a text file that is saved locally on your computer. In the script below, the function read.csv() loads the file called HR_data_combined.csv with the option header = TRUE which lets the function know that the first line of the file contains the names of the variables, while the rest contains the values. The information is assigned to a new object named heart_rates, which is a data frame.

data frame

A data frame is an object in R environment that contains multiple variables with different names. These variables can be accessed by using the name of the data frame, followed by $ and then the variable name, e.g. df2$var1 refers to variable named var1 in data frame df2.

To see the names of the variables and the first few rows of values, you can use the function head(). You see that this data set contains five variables: four heart rate measurements reported by students in BIOS 20151 (in beats per minute) and the year of the course. The values for the variables are arranged in columns, and the first row of the file contains the names of the variables: Rest1, Ex1, Rest2, Ex2, and Year.

NOTE: The data file has to be saved into the same folder as the .Rmd file for this to work, or else a file path has to be specified to provide the address of the saved file.

loading data from a package

R users create and share convenient collections of code or data called packages. You can also load data from a package, e.g. palmerpenguins, which is a data set of observations on penguins recorded at the Palmer research station (see explanation of the data set here: https://allisonhorst.github.io/palmerpenguins/).

To install this (or any other) package in R Studio, go to the Packages tab in the lower right window page, click Install and type palmerpenguins. We will use the data set called penguins that contains 8 variables, as you can see below in the output of the head() function:

descriptive statistics

Descriptive statistics are used to summarize a data set, in particular the two key measures are of the center of the values and of the spread of the values. The most common measures of center are the mean and the median, and the important measures of spread are the variance, the standard deviation, and the range of the data set.

One can calculate basic descriptive statistics as follows, note the use of the function paste() to combine strings of text with numeric values to make the output easier to understand:

There are two issues with the output of the above code. The first line correctly outputs the mean value, but you can see that it prints a whole lot of digits, making the output unnecessarily messy. There are several ways of rectifying this issue. One of them is to set the number of digits that R outputs on the screen using the function options(digits = 5) to limit the number of digits and then using print() after the cat() function to print the correctly formatted output; the only issue is the pesky [1] that gets added to the output:

The second and larger issue is that the mean of the bill length returns NA, which means that there are values missing in that data vector (which you could see when we printed out the head of the data). The option na.rm in the function mean() tells it to ignore any missing values:

Here are examples of median values:

Here are examples of variances:

Here are examples of standard deviations (square root of variance):

Here are examples of range (the minimum and maximum values of the data vector):

What can go wrong

When reading in files, either from your computer hard drive or from a URL, any mistake in the file name or the path (directory) will result in an error that looks like this:

Another common mistake is using a variable name without the data frame. For example, if you try to refer to the variable Rest1 without the data frame, R will not know what to do:

A different error is using a data frame as a variable. Since it contains multiple variables, we cannot calculate descriptive statistics on a whole data frame:

Exercises

The following R commands or scripts contain errors; your job is to fix them so they do what each exercise asks you to do. Try figuring out the errors on your own before clicking on the Hint box to expand it.

Calculate the mean of the second resting heart rate of the first 30 individuals (in the data frame heart_rates and variable Rest2):

Hint

Need to specify the data frame; use : to create a vector of indices from 1 to 30.

Calculate the range of penguin flipper lengths, assign it to a variable ran and print the maximum value:

Hint

Use option na.rm=TRUE to get rid of NAs; print the second element of the variable ran to show the minimum.

Calculate the ratios of all the bill lengths to the bill depth and print the mean and standard deviations of this ratio:

Hint

Cannot use variable name without the data frame; has to have the format df$Var; in mean() and sd() use option na.rm=TRUE to get rid of NAs

Use plot() to make a scatterplot of the first exercise heart rate as function of the first resting heart rate from the data frame heart_rates

Hint

Need to use the data frame name; switch the order of the variables in plot

Visualizing data sets

A picture is worth a lot of words, and a plot of data offers much more information than the basic descriptive statistics. A histogram offers a convenient visualization of a single variable data set.

histogram

A histogram is a plot of counts or frequencies of different values in a data vector, divided into bins. The x-axis typically shows the values of the variable, and the y-axis shows the counts, or frequencies, of the data in each bin.

R has a histogram function hist(), which does a passable job of representing the distribution of a variable such as flipper length or bill depth. The two histograms below provide visual descriptions of the two data sets:

Box plots

Sometimes a histogram is a bit too involved, especially when one wants a quick visual comparison of two variables. Here are example box plots for the first resting rate and the first exercise heart rate data sets:

The boxes in these plots extend from the first quartile to the third quartile, with the line in the middle being the median. Thus the middle half of the data set is contained within the boxes. The “whiskers” extend from the box to another 1.5 times the width of the box (or the min and the max, if they are closer), but this can be changed by setting the range option. Any points outside the whiskers are considered “outliers” and are shown as individual points.

The boxplot() function, like all others, has many options. They allow us to plot two or more box plots together, using the options at and names, for direct visual comparison of two data sets:

If you want to visualize the effect of a categorical variable on the distribution of a numeric variable, you can use the convenient expression notation in the first input of boxplot, as you can see below:

The expression bill_length_mm ~ species tells boxplot() to plot the distributions of bill_length_mm as a function of species, whose values are shown on the x-axis. The option data specifies the data frame, so you don’t need to put the data frame with the variables.

Exercises

Plot a histogram of body masses of all the penguins in the penguins data frame

Hint

data = option doesn’t work in function hist(), need to use dataframe and variable name, e.g.df$Var

Calculate the ratio of the penguin bill lengths to the bill depths and plot their histogram

Hint

Cannot use variable name without the data frame; has to have the format df$Var

Produce box plots of the second resting heart rates (variable Rest2 for different years:

Hint

Check the order of variables in the expression: it needs to have the format of Y ~ X