Tutorial 6: Data tables and tests

Objectives

calculate two-way data tables from data
assign matrix variables
perform the chi-squared test for independence
perform the ANOVA test for equal means

Data tables and the chi-squared test

matrices and data tables

R has many functions for different tests, including the chi-squared test for independence. To use it, one first has to input a data set in the form of a two-way table, where each row represents the values of one random variable, and each column represents the values of the second random variable. The following script shows how to manually input a 2 by 2 contingency table into a matrix. In the matrix function, ncol stands for number of columns, and nrow for number of rows. Notice the order in which the numbers are put into the matrix: down the first column, then the second, etc. Type help(matrix) for more details.

In order to access a specific element of the matrix, just like in vectors, R uses square brackets and two indices, first one for the row and second for the column. Below are examples of accessing two elements of the matrix data defined above, and how to reference a particular element of the matrix.

You can also access an entire row of a matrix by leaving the column index blank, or the entire column of a matrix by leaving the row index blank:

Visualization of data

Let us use the Palmer penguin data set to demonstrate how to count and visualize categorical data. The data frame penguins contains the variables species and year. The first one is of class “factor”, while the second one is integer (e.g. 2008) but we will use it as a categorical variable rather than a numeric one.

The function table() is used to count the number of different values in an array. If it is given two vector variables from a data frame, it will count the number of observations that have a specific value of both variables, called a contingency table. For example, the following code produces a count of different species observed in different years in the penguins data frame:

We can read this table to note that 26 Chinstrap penguins were observed in the year 2007, for example. This can be printed out from the table using row and column names instead of indices, like this:

A data table like this can be visualized using a bar plot. Here is an example, which shows the number of the three species for different years. The years labels are shown on the x-axis, which are taken from the column names, and the row names are used as labels for the legend. The colors are defined in the option col= .

Another useful calculation is the table of proportions. This is useful for comparing the fraction of different categories of one variable, for different values of the second one. The calculation below for the Palmer penguins first computes the proportions by column (that is, by year) by specifying the option margin = 2 ; then plots the fractions using a bar plot with the same use of column and row names.

The proportion table can be used to answer questions such as, “what fraction of penguins were species Gentoo in the year 2009”, essentially the conditional probability of Gentoo, given year 2009.

Chi-squared test

Based on a given data set, how likely is the hypothesis that the two random variables are independent? It is hard to do by hand (in the old days, you looked it up in a table of chi-squared values) but R will do it all for us: 1) calculate the expected counts, 2) compute the chi-squared value for the table, and 3) use the number of degrees of freedom and the chi-squared value to calculate the p-value of the independence hypothesis based on it. Use the chisq.test() function, and you will see output like this:

The results are the chi-squared values, the number of degrees of freedom (which depends on the number of rows and columns in the two-way table) and the p-value. The p-value is used to decide whether to reject the hypothesis, because it represents the likelihood of the hypothesis, given the data. In this case, the p-value is pretty small, so it seems relatively safe to reject the hypothesis of independence. To see the results of the hypothesis test, type print(test_output), and to access the p-value individually, use test_output$p.value.

Finally, we need to specify the significance level alpha for the hypothesis test. This refers to the probability of rejecting a true null hypothesis. For instance, if you reject the hypothesis at $\alpha=0.05$ significance, you are setting a 5% chance of falsely rejecting a correct hypothesis, also called the rate of false positives. Note that it says nothing about failing to reject an incorrect hypothesis, also called the rate of false negatives. See the next section of the tutorial for more explanation of hypothesis testing errors.

generating a data table

Suppose that you have two groups of people: one with genotype A and the other with genotype B. The question is: does one of the genotypes make a phenotype (e.g. disease) more likely? In other words, are the variables of genotype and phenotype linked?

The script below generates fake data sets of people with genotype A and genotype B by simulating random sampling with a given probability. This is done using the function sample() to generate a vector of simulated people, who have the status N (normal) or D (disease), with specified probabilities:

Are the two genotypes linked to disease? We know the true answer, because we set the probabilities of disease to be different in the two genotypes, but can we tell from the data set? The following script uses table() to count the number of people with disease and normal in the vectors dis_genA and dis_genB to produce a two-way table data_mat and runs the chi-squared test for independence, then prints out all the results (stored in object chisq_result) and the p-value (in chisq_result$p.value)

You see that the chi-squared test returns a low p-value, indicating that it is very unlikely that the two random variables (genotype and disease) are independent, based on the data set.

This script plots the frequency of healthy and disease in the two groups so you can visually compare them:

Exercises

Based on the code in the chunks above perform the following tasks:

Add a line to the script below that uses the function table() to print out the number of people with disease and normal in the genotype B group.

Hint

Use the example of how to use the function table from the code chunk above; remember to add print() around the table().

In the script below add lines to print out the numbers of people with disease in genotype A sample and in genotype B sample using the matrix data_mat and indexing.

Hint

Use the correct row number and column number as indices; remember to add print() around the table().

The script below runs the chisq.test() function, add a line to print out the p-value from the chi-squared test.

Hint

The p-value is a variable in the chisq_result object, see code above for example.

ANOVA test

Analysis of Variance (ANOVA), desribed in Chpater 10, allows us to test for a relationship between a numeric response variable and a categorical explanatory variable. In other words, does the mean value of a numeric variable depend on the category of another variable? For example, does the body mass of a penguin in the Palmer Penguins data set depend on the species? The code below produces a boxplot that summarizes the distribution of body masses for different species:

The mass for Gentoo species appears to be larger than for the other two, but how sure are we that it’s not a fluke, that is, if we sample again that the difference will still be there? This is what we quantify with the ANOVA hypothesis test, which is performed using the function aov; notice that the syntax for the test is that same as for function lm for linear regression:

The output contains several pieces of information, including the mean square difference between species (what I call VM in section 10.4) and the mean square residuals (VR). The ratio between the two is the value of F, which here is around 344. R provides the p-value for this value from the F-distribution, which is incredibly small (less than machine precision). Thus, we can reject the null hypothesis of no effect and say that the mass of the penguins really does depend on species.