Jacob Jameson
Fall 2021
We use for-loops to repeat a task over many different inputs or to repeat a simulation process several times.
for(value in c(1, 2, 3, 4, 5)) {
print(value)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
for (x in c(3, 6, 9)) {
print(x)
}
[1] 3
[1] 6
[1] 9
for (x in c(3, 6, 9)) {
print(x)
}
Our for-loop is equivalent to the following code. For each value in c(3,6,9)
, we assign the value to x
and the do the action between the curly brackets in order.
x <- 3
print(x)
x <- 6
print(x)
x <- 9
print(x)
The general structure of a for loop is as follows:
for (value in list_of_values) {
do something (based on value)
}
for (index in list_of_indices) {
do something (based on index)
}
Suppose we want to find the means of increasingly large samples.
mean1 <- mean(rnorm(5))
mean2 <- mean(rnorm(10))
mean3 <- mean(rnorm(15))
mean4 <- mean(rnorm(20))
mean5 <- mean(rnorm(25000))
means <- c(mean1, mean2, mean3, mean4, mean5)
means
[1] 0.711000420 -0.061282358 0.028560286 -0.153453712 -0.006932076
Let’s avoid repeating code with a for loop.
sample_sizes <- c(5, 10, 15, 20, 25000)
sample_means <- rep(0, length(sample_sizes))
for (i in seq_along(sample_sizes)) {
sample_means[[i]] <- mean(rnorm(sample_sizes[[i]]))
}
sample_means
[1] -0.115037250 -0.161057505 -0.008466999 -0.143818123 0.008976676
In the following slides we’ll explain each step.
Assign initial variables before starting the for loop.
# determine what to loop over
sample_sizes <- c(5, 10, 15, 20, 25000)
# pre-allocate space to store output
sample_means <- rep(0, length(sample_sizes))
To start: 1. create a vector of the sample_sizes we want to use 2. create a vector to store the output
sample_means <- rep(0, length(sample_sizes))
sample_means
[1] 0 0 0 0 0
Why do this? It makes the code more efficient. An alternative is to build up an object as you go, but this requires copying the data over and over again.
sample_means <- vector("double", length = 5)
sample_means <- double(5)
Each data type has a comparable function e.g. logical()
, integer()
, character()
.
To hold data of different types, we’ll use lists.
data_list <- vector("list", length = 5)
Determine what sequence to loop over.
for (i in 1:length(sample_sizes)) {
}
seq_along(x)
is synonymous to 1:length(x)
where x
is a vector.
Example:
vec <- c("x", "y", "z")
1:length(vec)
[1] 1 2 3
seq_along(vec)
[1] 1 2 3
sample_sizes <- c(5, 10, 15, 20, 25000)
seq_along(sample_sizes)
[1] 1 2 3 4 5
(What if sample_sizes is accidentally a 0-length vector? See what happens in R for Data Science.)
sample_sizes <- c(5, 10, 15, 20, 25000)
sample_means <- rep(0, length(sample_sizes))
for (i in seq_along(sample_sizes)) {
}
Use seq_along()
to be safe!
sample_sizes <- c(5, 10, 15, 20, 25000)
sample_means <- numeric(length(sample_sizes))
for (i in seq_along(sample_sizes)) {
sample_means[[i]] <- mean(rnorm(sample_sizes[[i]]))
}
sample_means
[1] 0.358923545 -0.134990944 -0.087576772 0.057910107 0.006155888
Save the mean of the sample to the ith place of the sample_means
vector.
This code falls, because we do not store the output in sample_means
in the for loop! (Compare to previous slide).
sample_sizes <- c(5, 10, 15, 20, 25000)
sample_means <- rep(0, length(sample_sizes))
for (i in seq_along(sample_sizes)) {
mean(rnorm(sample_sizes[[i]]))
}
sample_means
[1] 0 0 0 0 0
Right now, we’re calculating the mean, but it’s not being saved anywhere.
You get data stored in split over several csv files. We can read the data into R and store it store it as a single data set.
setwd("~/downloads/Module 8/loops")
file_1 <- read_csv("data_1999.csv")
file_2 <- read_csv("data_2000.csv") ...
file_22 <- read_csv("data_2020.csv")
data <- bind_rows(file_1, file_2, ..., file_22)
The data used for this exercise is fake data which I made with a for-loop. Run the code below (choose your own working directory) to follow along.
setwd("~/downloads/Module 8/loops")
file_list <- paste0("data_", 1999:2020, ".csv")
for (file in file_list) {
data <-tibble(id = 1:100,
employed = sample(c(0, 1, 1, 1),
100, replace = TRUE),
happy = sample(c(0,1),
100, replace = TRUE))
write_csv(data, file)
}
bind_rows() stacks two dataframe, or combines two vectors into a dataframe:
df_1 <- tibble(col1 = 1, col2 = "A")
df_2 <- tibble(col1 = 2:3, col2 = c("B", "C"))
bind_rows(df_1, df_2)
# A tibble: 3 x 2
col1 col2
<dbl> <chr>
1 1 A
2 2 B
3 3 C
?list.files():
These functions produce a character vector of the names of files … in the named directory.
.csv$
matches any
string and .csv ensures the string ends in csv.list.files("~/downloads/Module 8/loops", pattern = "*.csv$")
[1] "data_1999.csv" "data_2000.csv" "data_2001.csv" "data_2002.csv"
[5] "data_2003.csv" "data_2004.csv" "data_2005.csv" "data_2006.csv"
[9] "data_2007.csv" "data_2008.csv" "data_2009.csv" "data_2010.csv"
[13] "data_2011.csv" "data_2012.csv" "data_2013.csv" "data_2014.csv"
[17] "data_2015.csv" "data_2016.csv" "data_2017.csv" "data_2018.csv"
[21] "data_2019.csv" "data_2020.csv"
file_names <- list.files(,pattern = "*.csv$")
output <- vector("list", length(file_names))
for (i in seq_along(file_names)) {
output[[i]] <- read_csv(file_names[[i]]) %>%
mutate(year = str_extract(file_names[[i]], "[0-9]{4}"))
}
data <- bind_rows(output)
View(data)
setwd("~/downloads/Module 8/loops")
# by default, reads files in working directory
file_list <- list.files(pattern = "*.csv$")
out <- tibble()
for (file in file_list) {
temp <- read_csv(file)
out <- bind_rows(out, temp)
}
nrow(out)
[1] 2200
When possible, take advantage of the fact that R is vectorized.
a <- 7:11
b <- 8:12
out <- rep(0L, 5)
for (i in seq_along(a)) {
out[[i]] <- a[[i]] + b[[i]]
}
out
[1] 15 17 19 21 23
This is a bad example of a for loop!
a <- 7:11
b <- 8:12
out <- a + b
out
[1] 15 17 19 21 23
Use vectorized operations and tidyverse functions like mutate() when you can.
print()
)Further study: Many R coders prefer the map()
family functions from purrr
or base R apply family. See iteration in R for Data Science