`library(ISLR)`

# Section 7 - Non-linear Models

Note: this section has several different types of models. We cannot cover all of them in 1.25 hours. We will go over a few examples but once you understand how the examples, you should be able to apply the logic to other types models.

This week, we will use the Wage data that is part of the ISLR package

Load the Wage data and drop the “Wage” columns, as we did in Section 6

```
<- Wage
wage_data <- wage_data[, -10] wage_data
```

To make life easier for later analyses, I will start by sorting the data on age. To do this, I will use a package called `dplyr`

, which has a lot of great tools for data manipulation.

`library(dplyr)`

The first two functions we will use are: - `%>%`

which means “and then do” - `arrange()`

which sorts the data on the column inside the parentheses

`<- wage_data %>% arrange(age) wage_data `

We will start by splitting the data into training and test sets

```
set.seed(222)
<- sample(1:nrow(wage_data), round(nrow(wage_data) * 0.8))
train <- sort(train)
train <- which(!(seq(nrow(wage_data)) %in% train)) test
```

To quickly and easily measure MSEP, write our own function

```
<- function(predictions, true_vals) {
msep_func <- mean((predictions - true_vals)^2)
MSEP return(MSEP)
}
```

## Polynomial Regression

We can start by fitting a polynomial regression using only age

`<- lm(wage ~ poly(age, 4), data = wage_data[train,]) age_poly `

Extract the coefficients from the model

`coef(summary(age_poly))`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) 111.88351 0.8086511 138.358203 0.000000e+00
poly(age, 4)1 394.18926 39.6156514 9.950342 6.942194e-23
poly(age, 4)2 -434.00566 39.6156514 -10.955409 2.748637e-27
poly(age, 4)3 105.56550 39.6156514 2.664742 7.756432e-03
poly(age, 4)4 -90.09776 39.6156514 -2.274297 2.303622e-02
```

When you use `poly()`

, it returns a matrix of “orthogonal polynomials” so the columns of the matrix are linear combinations of age, age\(^2\), age\(^3\), age\(^4\). Let’s take a look!

`head(poly(wage_data$age, 4))`

```
1 2 3 4
[1,] -0.0386248 0.05590873 -0.07174058 0.08672985
[2,] -0.0386248 0.05590873 -0.07174058 0.08672985
[3,] -0.0386248 0.05590873 -0.07174058 0.08672985
[4,] -0.0386248 0.05590873 -0.07174058 0.08672985
[5,] -0.0386248 0.05590873 -0.07174058 0.08672985
[6,] -0.0386248 0.05590873 -0.07174058 0.08672985
```

`head(wage_data$age)`

`[1] 18 18 18 18 18 18`

If you want it to return the raw powers of age, you can add an argument to the function `poly()`

`head(poly(wage_data$age, 4, raw = TRUE))`

```
1 2 3 4
[1,] 18 324 5832 104976
[2,] 18 324 5832 104976
[3,] 18 324 5832 104976
[4,] 18 324 5832 104976
[5,] 18 324 5832 104976
[6,] 18 324 5832 104976
```

Although the two forms give you different numbers, they result in the same predictions, because your model is still a linear combination of the original powers

```
<- lm(wage ~ poly(age, 4, raw = TRUE),
age_poly_TRUE data = wage_data[train,])
```

Extract the coefficients from the model

`coef(summary(age_poly_TRUE))`

```
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.085312e+02 6.559430e+01 -3.179106 0.0014961570
poly(age, 4, raw = TRUE)1 2.391179e+01 6.439356e+00 3.713383 0.0002091772
poly(age, 4, raw = TRUE)2 -6.646745e-01 2.257678e-01 -2.944063 0.0032705193
poly(age, 4, raw = TRUE)3 8.403840e-03 3.362741e-03 2.499105 0.0125172334
poly(age, 4, raw = TRUE)4 -4.098605e-05 1.802141e-05 -2.274297 0.0230362203
```