library(ISLR)
Section 7 - Non-linear Models
Note: this section has several different types of models. We cannot cover all of them in 1.25 hours. We will go over a few examples but once you understand how the examples, you should be able to apply the logic to other types models.
This week, we will use the Wage data that is part of the ISLR package
Load the Wage data and drop the “Wage” columns, as we did in Section 6
<- Wage
wage_data <- wage_data[, -10] wage_data
To make life easier for later analyses, I will start by sorting the data on age. To do this, I will use a package called dplyr
, which has a lot of great tools for data manipulation.
library(dplyr)
The first two functions we will use are: - %>%
which means “and then do” - arrange()
which sorts the data on the column inside the parentheses
<- wage_data %>% arrange(age) wage_data
We will start by splitting the data into training and test sets
set.seed(222)
<- sample(1:nrow(wage_data), round(nrow(wage_data) * 0.8))
train <- sort(train)
train <- which(!(seq(nrow(wage_data)) %in% train)) test
To quickly and easily measure MSEP, write our own function
<- function(predictions, true_vals) {
msep_func <- mean((predictions - true_vals)^2)
MSEP return(MSEP)
}
Polynomial Regression
We can start by fitting a polynomial regression using only age
<- lm(wage ~ poly(age, 4), data = wage_data[train,]) age_poly
Extract the coefficients from the model
coef(summary(age_poly))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.88351 0.8086511 138.358203 0.000000e+00
poly(age, 4)1 394.18926 39.6156514 9.950342 6.942194e-23
poly(age, 4)2 -434.00566 39.6156514 -10.955409 2.748637e-27
poly(age, 4)3 105.56550 39.6156514 2.664742 7.756432e-03
poly(age, 4)4 -90.09776 39.6156514 -2.274297 2.303622e-02
When you use poly()
, it returns a matrix of “orthogonal polynomials” so the columns of the matrix are linear combinations of age, age\(^2\), age\(^3\), age\(^4\). Let’s take a look!
head(poly(wage_data$age, 4))
1 2 3 4
[1,] -0.0386248 0.05590873 -0.07174058 0.08672985
[2,] -0.0386248 0.05590873 -0.07174058 0.08672985
[3,] -0.0386248 0.05590873 -0.07174058 0.08672985
[4,] -0.0386248 0.05590873 -0.07174058 0.08672985
[5,] -0.0386248 0.05590873 -0.07174058 0.08672985
[6,] -0.0386248 0.05590873 -0.07174058 0.08672985
head(wage_data$age)
[1] 18 18 18 18 18 18
If you want it to return the raw powers of age, you can add an argument to the function poly()
head(poly(wage_data$age, 4, raw = TRUE))
1 2 3 4
[1,] 18 324 5832 104976
[2,] 18 324 5832 104976
[3,] 18 324 5832 104976
[4,] 18 324 5832 104976
[5,] 18 324 5832 104976
[6,] 18 324 5832 104976
Although the two forms give you different numbers, they result in the same predictions, because your model is still a linear combination of the original powers
<- lm(wage ~ poly(age, 4, raw = TRUE),
age_poly_TRUE data = wage_data[train,])
Extract the coefficients from the model
coef(summary(age_poly_TRUE))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.085312e+02 6.559430e+01 -3.179106 0.0014961570
poly(age, 4, raw = TRUE)1 2.391179e+01 6.439356e+00 3.713383 0.0002091772
poly(age, 4, raw = TRUE)2 -6.646745e-01 2.257678e-01 -2.944063 0.0032705193
poly(age, 4, raw = TRUE)3 8.403840e-03 3.362741e-03 2.499105 0.0125172334
poly(age, 4, raw = TRUE)4 -4.098605e-05 1.802141e-05 -2.274297 0.0230362203