Tutorial 6: Data tables and Booleans

Objectives

  • calculate two-way data tables from data
  • assign matrix variables
  • perform the chi-squared test for independence
  • logical operators and tests in R
  • calculations using Boolean vectors

Data tables and the chi-squared test

matrices and data tables

R has many functions for different tests, including the chi-squared test for independence. To use it, one first has to input a data set in the form of a two-way table, where each row represents the values of one random variable, and each column represents the values of the second random variable. The following script shows how to manually input a 2 by 2 contingency table into a matrix. In the matrix function, ncol stands for number of columns, and nrow for number of rows. Notice the order in which the numbers are put into the matrix: down the first column, then the second, etc. Type help(matrix) for more details.

In order to access a specific element of the matrix, just like in vectors, R uses square brackets and two indices, first one for the row and second for the column. Below are examples of accessing two elements of the matrix data defined above, and how to reference a particular element of the matrix.

You can also access an entire row of a matrix by leaving the column index blank, or the entire column of a matrix by leaving the row index blank:

Chi-squared test

Based on a given data set, how likely is the hypothesis that the two random variables are independent? It is hard to do by hand (in the old days, you looked it up in a table of chi-squared values) but R will do it all for us: 1) calculate the expected counts, 2) compute the chi-squared value for the table, and 3) use the number of degrees of freedom and the chi-squared value to calculate the p-value of the independence hypothesis based on it. Use the chisq.test() function, and you will see output like this:

The results are the chi-squared values, the number of degrees of freedom (which depends on the number of rows and columns in the two-way table) and the p-value. The p-value is used to decide whether to reject the hypothesis, because it represents the likelihood of the hypothesis, given the data. In this case, the p-value is pretty small, so it seems relatively safe to reject the hypothesis of independence. To see the results of the hypothesis test, type print(test_output), and to access the p-value individually, use test_output$p.value.

Finally, we need to specify the significance level alpha for the hypothesis test. This refers to the probability of rejecting a true null hypothesis. For instance, if you reject the hypothesis at \(\alpha=0.05\) significance, you are setting a 5% chance of falsely rejecting a correct hypothesis, also called the rate of false positives. Note that it says nothing about failing to reject an incorrect hypothesis, also called the rate of false negatives. See the next section of the tutorial for more explanation of hypothesis testing errors.

generating a data table

Suppose that you have two groups of people: one with genotype A and the other with genotype B. The question is: does one of the genotypes make a phenotype (e.g. disease) more likely? In other words, are the variables of genotype and phenotype linked?

The script below generates fake data sets of people with genotype A and genotype B by simulating random sampling with a given probability. This is done using the function sample() to generate a vector of simulated people, who have the status N (normal) or D (disease), with specified probabilities:

Are the two genotypes linked to disease? We know the true answer, because we set the probabilities of disease to be different in the two genotypes, but can we tell from the data set? The following script uses table() to count the number of people with disease and normal in the vectors dis_genA and dis_genB to produce a two-way table data_mat and runs the chi-squared test for independence, then prints out all the results (stored in object chisq_result) and the p-value (in chisq_result$p.value)

You see that the chi-squared test returns a low p-value, indicating that it is very unlikely that the two random variables (genotype and disease) are independent, based on the data set.

This script plots the frequency of healthy and disease in the two groups so you can visually compare them:

Exercises

Based on the code in the chunks above perform the following tasks:

  1. Add a line to the script below that uses the function table() to print out the number of people with disease and normal in the genotype B group.

Use the example of how to use the function table from the code chunk above; remember to add print() around the table().

  1. In the script below add lines to print out the numbers of people with disease in genotype A sample and in genotype B sample using the matrix data_mat and indexing.

Use the correct row number and column number as indices; remember to add print() around the table().

  1. The script below runs the chisq.test() function, add a line to print out the p-value from the chi-squared test.

The p-value is a variable in the chisq_result object, see code above for example.

Logical values and calculations

logical tests

A logical test is an operation that returns either TRUE or FALSE (called Boolean values.) Here are several examples:

Please note that to ask the question β€œare x and y equal” the use of double equal signs is required, because a single equal sign means variable assignment.

Logical tests can be applied to entire vectors all at once. For example, let us assign a vector variable x and compare all of its values to 5:

The first six values of x are not greater than 5, while the last five are greater than 5, which is indicated by the Boolean output of the test.

calculations using Boolean vectors

Boolean values are distinct from other types of values, such as character strings or numbers. However, in R they can be used as numbers for computational purposes, with FALSE converting to 0 and TRUE converting to 1. This allows for convenient counting of the tests performed on large vectors. For instance, we can use the uniform number generator for integers between 0 and 20, to generate a vector of 10 values, and then ask how many of those values are less than 10 by using sum(), which will count the number of TRUE values:

logical operators: AND and OR

One can combine logical values in two common ways: using the AND operator and the OR operator. The AND operator asks whether both statements are true, while the OR operator asks if at least one statement is true. R uses the & symbol for AND to determine whether is true that x is greater than 1 AND than y is less than 5:

If we change the value of y so it is greater than five, than the logical & operator will be false:

The OR operator uses the symbol | and in the following example it is true because one of the logical statements is true:

Only if both statements are false, the logical OR is false:

Both & and | can be applied to vectors of Booleans, where they apply their logical operations to each corresponding element. For example, if x is a vector of integers from 0 to 10, one can ask how many of those values are both above 1 and below 5:

Exercises:

  1. Generate a sample of 100 values from the uniform distribution between 1 and 20 with replacement, assign it to vector sample_vec and use sum() to calculate how many of them are equal to 20 and print out that number.

Check the inputs into the function sample(); the second line should use the logical test ==

  1. Generate a sample of 20 values from the binomial distribution with p=0.4 and n=5, assign it to vector binom_vec and use sum() to calculate how many of them are greater than 1 and print out that number.

Check the inputs into the function rbinom(); the second line should use the logical test >

  1. Generate a sample of 20 values from the binomial distribution with p=0.4 and n=5 and calculate how many of them are BOTH above 1 AND below 4.

Check the inputs into the function rbinom(); the second line should use the conjunction & to combine two logical tests