```
= c(9, 10, 11, 12, 13, 14, 15, 16, 17)
dolly = c("b", "b", "b", "b", "b") bees
```

# Module 3: Vectors and Lists

This lesson is about understanding different data types in R, namely vectors and lists. We will build up to the understanding of what a dataframe really is at a high level. The lab will explore how vector operations are particularly useful in the context of data manipulation and analysis.

## Vectors and Lists

Download a copy of Module 3 slides

Download data for Module 3 lab and tutorial

## Lab 3

### General Guidelines:

You will encounter a few functions we did not cover in the lecture video. This will give you some practice on how to use a new function for the first time. You can try following steps:

- Start by typing
`?new_function`

in your Console to open up the help page - Read the help page of this new_function. The description might be too technical for now. That’s OK. Pay attention to the Usage and Arguments, especially the argument
`x`

or`x`

,`y`

(when two arguments are required) - At the bottom of the help page, there are a few examples. Run the first few lines to see how it works
- Apply it in your lab questions

**It is highly likely that you will encounter error messages while doing this lab Here are a few steps that might help get you through it.**

- Locate which line is causing this error first
- Check if you may have a typo in the code. Sometimes another person can spot a typo faster than you.
- If you enter the code without any typo, try googling the error message
- Scroll through the top few links see if any of them helps
- Try working on the next few questions while waiting for answers by TAs

### Warm-up

- In the lecture, we covered
`c()`

,`:`

,`rep()`

,`seq()`

among other ways to create vectors.

- Recreate dolly using
`:`

. - Create the same vector using
`seq()`

. - Recreate bees using
`rep()`

.

- We are now going to use the functions
`rnorm()`

and`runif()`

to initialize vectors.

```
= rnorm(100)
random_norm = runif(1000) random_unif
```

How long are the vectors

`random_norm`

and`random_unif`

? Use`length()`

to verify.What are the largest and smallest values in

`random_norm`

and`random_unif`

? Use`min()`

and`max()`

.Use

`mean()`

and`sd()`

to calculate the mean and standard deviation of the two distributions.Create a new vector with 10000 draws from the standard normal distribution.

`rnorm()`

by default sets mean = 0 (see`?rnorm`

). Create a vector of 10000 draws from the normal distribution with mean = 1. Use`mean()`

to verify.

Notice the functions `min()`

, `max()`

, `mean()`

and `sd()`

all take a vector with many values and summarize them as one value. These are good to use with `summarize()`

when doing data analysis on simple dataframes.

#### Data Types

Use

`typeof()`

to verify the data types of`dolly`

,`bees`

,`random_unif`

Coerce dolly to a character vector. Recall we have functions

`as.<type>()`

for this kind of coercion.Try to coerce bees to type numeric. What does R do when you ask it to turn “b” into a number?

#### Vectorized Math

- a and b are vectors of length 10. Look at them in the console.

```
<- 1:10
a <- rep(c(2, 4), 5) b
```

Add a and b element by element.

Subtract a and b element by element.

Divide a by b element by element.

Multiply a and b element by element.

Raise the element of a to the power of b element by element.

Multiply each element of a by 3 then subtract b

Raise each element of b to the third power.

Take the square root of each element of a.

#### Calculating Mean and Standard Deviation

##### Calculating the Mean

In this exercise, we will calculate the mean of a vector of random numbers. Wewill practice assigning new variables and using functions in R.

We can run the following code to create a vector of 1000 random numbers. The function `set.seed()`

ensures that the process used to generate random numbers is the same across computers.

**Note**: `rf()`

is a R command we use to generate 1000 random numbers according to the F distribution, and `10`

and `100`

are parameters that specify how “peaked” the distribution is.

```
set.seed(1)
= rf(1000, 10, 100) random_numbers
```

Write code that gives you the sum of `random_numbers`

and saves it to a new variable called `numbers_sum`

:

**Hint:** To sum the numbers in a vector, use the `sum()`

function.

**Note**: You don’t automatically see the output of `numbers_sum`

when you assign it to a variable. Type `numbers_sum`

into the console and run it to see the value that you assigned it.

Write code that gives you the number of items in the `random_numbers`

vector and saves it to a new variable called `numbers_count`

:

**Hint:** To count the number of items in a vector, use the `length()`

function.

Now write code that uses the above two variables to calculate the average of `random_numbers`

and assign it to a new variable called `this_mean`

.

What number did you get? It should have been 1.018. If it isn’t, double check your code!

R actually has a built in function to calculate the mean for you, so you don’t have to remember how to build it from scratch each time! Check your above answer by using the `mean()`

function on the `random_numbers`

vector.

##### Calculating the Standard Deviation

Now that you’ve got that under your fingers, let’s move on to standard deviation.

We will be converting the following formula for calculating the sample standard deviation into code:

\(s = \sqrt{\frac{\sum_{i=1}^n (x_i - \overline{x})^2} {n-1}}\)

For this, we’ll review the concept of *vectorization*. This means that an operation like subtraction will act on all numbers in a vector at the same time.

Subtract `this_mean`

from the `random_numbers`

vector. Did each number in `random_numbers`

change?

Try to write the formula for standard deviation in R code using the `sqrt()`

, `sum()`

, and `length()`

functions, along with other operators (`^`

, `/`

, `-`

). Assign it to a new variable called `this_sd`

. Watch out for your parentheses!

What number did you get for `this_sd`

, or the standard deviation of `random_numbers`

? If you didn’t get 0.489704, recheck your code!

R also has a built in function for standard deviation. Check if you calculated the standard deviation correctly by using the `sd()`

function on the `random_numbers`

vector.

#### Making a Histogram of Our Numbers

What do these random numbers look like, anyway? We can use base plotting in R to visualize the distribution of our random numbers.

Run the following code to visualize the original distribution of `random_numbers`

as a histogram.

`hist(random_numbers)`

Notice how most of the values are concentrated on the left-hand side of the graph, while there is a longer “tail” to the right? Counterintuitively, this is known as a right-skewed distribution. When we see a distribution like this, one common thing to do is to normalize it.

This is also known as *calculating a z-score*, which we will cover next.

#### Calculating a Z-Score

The formula for calculating a z-score for a single value, or *normalizing* that value, is as follows:

\(z = \frac{x - \bar{x}}{s}\)

This can be calculated for each value in `random_numbers`

in context of the larger set of values.

Can you translate this formula into code?

Using `random_numbers`

, `this_mean`

, and `this_sd`

that are already in your environment, write a formula to transform all the values in `random_numbers`

into z-scores, and assign it to the new variable `normalized_data`

.

**Hint:** R is vectorized, so you can subtract the mean from each random number in `random_numbers`

in a straightforward way.

Take the mean of `normalized_data`

and assign it to a variable called `normalized_mean`

.

**Note**: If you see something that ends in “e-16”, that means that it’s a very small decimal number (16 places to the right of the decimal point), and is essentially 0.

Take the standard deviation of `normalized_data`

and assign it to a variable called `normalized_sd`

.

What is the value of `normalized_mean`

? What is the value of `normalized_sd`

? You should get a vector that is mean zero and has a standard deviation of one, because the data has been normalized.

##### Making a Histogram of Z-scores

Let’s plot the z-scores and see if our values are still skewed. How does this compare to the histogram of `random_numbers`

? Run the following code:

`hist(normalized_data)`

Is the resulting data skewed?

### Calculating a T-Score

T-tests are used to determine if two sample means are equal. The formula for calculating a t-score is as follows:

\(t = \frac{\overline{x}_1 - \overline{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)

where \(\overline{x}_i\) is the mean of the first or second set of data, \(s_i\) is the sample standard deviation of the first or second set of data, and \(n_i\) is the sample size of the \(i\)th set of data.

We’ll first create two data sets of random numbers following a normal distribution:

```
set.seed(1)
<- rnorm(1000, 3)
data_1 <- rnorm(100, 2) data_2
```

Here’s how we’ll calculate the mean (`x_1`

), standard deviation (`s_1`

), and sample size (`n_1`

) of the first data set:

```
<- mean(data_1)
x_1 <- sd(data_1)
s_1 <- length(data_1) n_1
```

What numeric types do you get from doing this? Try running the `typeof()`

function on each of `x_1`

, `s_1`

, and `n_1`

. We have you started with `x_1`

.

`typeof(x_1)`

`[1] "double"`

What object type is `n_1`

?

Can you calculate the same values for `data_2`

, assigning mean, standard deviation, and length to the variables of `x_2`

, `s_2`

, and `n_2`

, respectively?

What values do you get for `x_2`

and `s_2`

?

Now, you should be able to translate the t-score formula (\(\frac{\overline{x}_1 - \overline{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\)) into code, based on the above calculated values.

What did you get for the t-score? You should have gotten 9.243, if not, double check your code!

The t-score’s meaning depends on your sample size, but in general t-scores close to 0 imply that the means are not statistically distinguishable, and large t-scores (e.g. t > 3) imply the data have different means.

#### Performing a T-Test

Once again, R has a built in function that will perform a T-test for us, aptly named `t.test()`

. Look up the arguments the function `t.test()`

takes, and perform a T-test on `data_1`

and `data_2`

.

What are the sample means, and are they distinguishable from each other?

Well done! You’ve learned how to work with R to calculate basic statistics. We’ve had you generate a few by hand, but be sure to use the built-in functions in R in the future.

**Want to improve this tutorial?** Report any suggestions/bugs/improvements on here! We’re interested in learning from you how we can make this tutorial better.