Jacob Jameson
Summer 2021
dplyr is based on the concepts of functions as verbs that manipulate data frames. It is part of tidyverse.
Single data frame functions / verbs:
Our data comes from the National Health and Nutrition Survey.
library(tidyverse)
library(haven)
nhanes <- read_dta('nhanes_clean.dta')
names(nhanes)
[1] "id" "age" "sex" "race"
[5] "height" "weight" "sbp" "dbp"
[9] "cholesterol" "triglycerides" "hdl" "healthstatus"
[13] "heartattack" "diabetes" "finalwgt" "strata"
[17] "psu" "today" "dob" "bmi"
[21] "HighBP"
Pause this video and take some time to further explore this data!
You have see examples of filter()
that return rows based on a single condition, but the filter option also allows AND and OR style filters:
filter(condition1, condition2)
will return rows where both conditions are met.filter(condition1, !condition2)
will return all rows where condition one is true but condition 2 is not.filter(condition1 | condition2)
will return rows where condition 1 and/or condition 2 is met.filter(xor(condition1, condition2)
will return all rows where only one of the conditions is met, and not when both conditions are met.Multiple AND, OR and NOT conditions can be combined. The sample code will return all age 50 respondents that either have a weight above 113kg (250lbs) or a bmi (body mass index) above 30.
nhanes %>%
filter(age == 50, (weight > 113 | bmi > 30))
There is another option which avoids the continuous retyping of columns names: one_of(). You can set up column names upfront, and then refer to them inside a select() statement by either wrapping them inside one_of() or by using the !! operator. This will also allow you to re-order columns.
blood_related <- c("sbp", "dbp",
"cholesterol", "hdl",
"triglycerides")
nhanes %>%
select(!!blood_related)
nhanes %>%
select(one_of(blood_related))
The select_if
function allows you to pass functions which return logical statements. For instance you can select all string columns by using select_if(is.character)
. Similarly, you can add is.numeric
, is.integer
, is.double
, is.logical
, is.factor
.
nhanes %>%
select_if(is.character) %>%
glimpse
Rows: 1,267
Columns: 1
$ id <chr> "NC0001", "NC0002", "NC0003", "NC0004", "NC0005", "NC0006", "NC0007…
You can also select the negation but in this case you will need to add a tilde ~ to ensure that you still pass a formula to select_if
. The select_all/if/at functions require that a function or a formula is passed as an argument. If you have to add any negation or arguments, you will have to wrap your function inside funs() or add a tilde ~ before to remake it a function.
nhanes %>%
select_if(~!is.numeric(.)) %>%
glimpse
Rows: 1,267
Columns: 2
$ id <chr> "NC0001", "NC0002", "NC0003", "NC0004", "NC0005", "NC0006", "NC000…
$ dob <date> NA, 1943-08-24, 1954-05-08, 1950-06-25, 1987-06-11, 1956-10-30, 1…
nhanes %>%
select_if(funs(!is.numeric(.))) %>%
glimpse
Rows: 1,267
Columns: 2
$ id <chr> "NC0001", "NC0002", "NC0003", "NC0004", "NC0005", "NC0006", "NC000…
$ dob <date> NA, 1943-08-24, 1954-05-08, 1950-06-25, 1987-06-11, 1956-10-30, 1…
Similarly mean > 4 is not a function in itself, so you will need to add a tilde upfront, or wrap it inside funs() to turn the statement into a function.
nhanes %>%
select_if(is.numeric) %>%
select_if(~mean(., na.rm=TRUE)>4 ) %>%
glimpse()
or shorter:
nhanes %>%
select_if(~is.numeric(.) & mean(., na.rm=TRUE) > 4) %>%
glimpse()
If you will be using a select()
statement, you can rename straight in the select function. Pay attention to which comes first
nhanes %>%
select(systolic_blood_pressure = sbp,
diastolic_blood_pressure = dbp) %>%
glimpse
If you want to retain all columns and therefore have no select()
statement, you can rename by adding a rename()
statement.
nhanes %>%
rename(systolic_blood_pressure = sbp,
diastolic_blood_pressure = dbp) %>%
glimpse
The select_all() function allows changes to all columns, and takes a function as an argument.
nhanes %>%
select_all(toupper) %>%
glimpse
nhanes %>%
select_all(tolower) %>%
glimpse
nhanes %>%
mutate(weight_lbs = weight*2.2046) %>%
glimpse
New columns can be made with aggregate functions such as average, median, max, min, sd, … The sample code makes two new columns: one showing the difference of each row versus the average sbp number, and one showing the difference versus the lowest number for sbp.
nhanes %>%
select(id, sbp) %>%
mutate(
sbp_vs_AVG = sbp - round(mean(sbp), 1),
sbp_vs_MIN = sbp - min(sbp)) %>%
glimpse
The ifelse()
function deserves a special mention because it is particularly useful if you don’t want to mutate the whole column in the same way. With ifelse()
, you first specify a logical statement, afterwards what needs to happen if the statement returns TRUE, and lastly what needs to happen if it’s FALSE.
nhanes %>%
select(id, sbp) %>%
mutate(hypertension = ifelse(sbp > 130, 'Yes', 'No')) %>%
glimpse
nhanes %>%
select(id, sbp) %>%
mutate(status = case_when(
sbp > 180 ~ "Hypertensive Crisis",
sbp > 140 ~ "Hypertension Stage 2",
sbp > 130 ~ "Hypertension Stage 1",
sbp > 120 ~ "Elevated",
TRUE ~ "Normal"))
Calling dplyr verbs always outputs a new data frame, it does not alter the existing data frame
So to keep the changes, we have to reassign the data frame to be the output of the pipe!
EX:
nhanes <- nhanes %>%
mutate(weight_lbs = weight*2.2046)
or
nhanes2 <- nhanes %>%
mutate(weight_lbs = weight*2.2046)
write.csv(nhanes2, "nhanes_clean.csv")