library(ISLR)
data(College)
<- College college_data
Section 2.2 - KNN and Linear Regression Code
There are lots of great datasets available as part of R packages. Page 14 of Introduction to Statistical Learning with Applications in R Table 1.1 lays out 15 data sets available from R packages. We will use the College dataset from the ISLR package. The first time you ever use a package, you need to install it. Then, every time you want to use the package, you use library(package_name)
. We will use the college data. Note that details on this data are available online: https://cran.r-project.org/web/packages/ISLR/ISLR.pdf Page 5. You can also get the same information in R by typing: help(“College”) or ?College.
Explore Data
Let’s learn about our data. To get the names of the columns in the dataframe, we can use the function colnames()
colnames(college_data)
[1] "Private" "Apps" "Accept" "Enroll" "Top10perc"
[6] "Top25perc" "F.Undergrad" "P.Undergrad" "Outstate" "Room.Board"
[11] "Books" "Personal" "PhD" "Terminal" "S.F.Ratio"
[16] "perc.alumni" "Expend" "Grad.Rate"
To find out how many rows and columns are in the dataset, use dim()
Recall that this gives us Rows followed by Columns
dim(college_data)
[1] 777 18
You can also look in the “environment” tab, press the blue arrow next to college_data and it will drop down showing the column names with their types and first few values. For college, all columns except the first are numeric. The first column is a factor column, which means it’s categorical. To get a better sense of the data, let’s look at it:
View(college_data)
Suppose we are interested in predicting whether a college is private or public based on available covariates, like Number accepted, enrolled, etc. Additionally, let’s suppose you don’t want certain variables included in your dataset. You can drop these functions using -c(). For example, let’s suppose you don’t want the Apps or Student to Faculty Ratio included in your dataset.
<- college_data[, -c(15, 2)] college_data
Be careful when you are dropping multiple columns. You need to put the numbers in reverse order (from highest to lowest). This is because if you drop the second column first, then the 15th column becomes the the 14th column.
<- College
college_data <- college_data[, -c(2)]
college_data <- college_data[, -c(15)] college_data
A less manual way of dropping columns is to use R to first use R to find the corresponding indices in the data columns. Go back to the original college data
<- College college_data
Find the indices (i.e. column positions) of the columns to drop
<- which(names(college_data) %in% c("Apps", "S.F.Ratio"))
to_drop print(to_drop)
[1] 2 15
Reverse the indices as suggested above
<- rev(to_drop)
to_drop print(to_drop)
[1] 15 2
Now use the object you have defined to drop the columns
<- college_data[, -c(to_drop)] college_data
Also sometimes we have factor variables that we want to convert to numeric variables. To check variable types, you can use the “str” function
str(college_data)
'data.frame': 777 obs. of 16 variables:
$ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
$ Accept : num 1232 1924 1097 349 146 ...
$ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
$ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
$ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
$ F.Undergrad: num 2885 2683 1036 510 249 ...
$ P.Undergrad: num 537 1227 99 63 869 ...
$ Outstate : num 7440 12280 11250 12960 7560 ...
$ Room.Board : num 3300 6450 3750 5450 4120 ...
$ Books : num 450 750 400 450 800 500 500 450 300 660 ...
$ Personal : num 2200 1500 1165 875 1500 ...
$ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
$ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
$ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
$ Expend : num 7041 10527 8735 19016 10922 ...
$ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
You can see that the Private variable is a factor. We can convert it to a numeric variable using the “as.numeric” function. I like my binary variables in R to be 0/1. In R, most factors automatically convert to a binary 1/2 format. I usually prefer a binary 0/1 format. To transform, I subtract 1.
$Private <- as.numeric(college_data$Private) - 1
college_datasummary(college_data$Private)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 1.0000 0.7272 1.0000 1.0000
summary(College$Private)
No Yes
212 565
Let’s get back our original sample
<- College college_data
Testing and Training Sets
In order to make this interesting, let’s split our data into a training set and a test set. To do this, we will use set.seed()
, which will allow us to draw the same pseudorandom numbers the next time we run this code, and we will use the sample()
function.
set.seed(222)
The sample()
function takes two arguments: The first is a vector of numbers from which to draw a random sample. The second is the number of random numbers to draw. The default is to sample without replacement, but you can sample with replacement by adding “, replace = TRUE
” inside the function. Now, let’s generate a list of indices from the original dataset that will be designated part of the test set using sample()
<- sample(1:(nrow(college_data)), round(0.2 * nrow(college_data))) test_ids
To identify the training_ids, we want all of the numbers from 1:nrow(college_data) that aren’t test IDs. Recall that which()
returns the indices for which the statement inside the parentheses is true. which(!())
returns the indices for which the statement inside the parentheses is false. The “!” means “not”. Also, if you wanted to know which values of vector A were in vector B, you can use which(A %in% B)
. So if you want to know which values of vector A are NOT in vector B, you use which(!(A %in B))
, so that’s what we will do – vector A is the vector of all integers between 1 and the number of rows in our data. vector B is the vector of test IDs
<- which(!(1:(nrow(college_data)) %in% test_ids)) training_ids
We can use these indices to define our test and training sets by putting those vectors in the row position inside square brackets.
<- college_data[test_ids,]
test_data <- college_data[training_ids,] training_data
KNN Classification
Let’s develop a KNN model to try to predict whether it’s a private college using all available features.
To use KNN for classification, we need to install and load the library “class”
library(class)
knn() is the function we will use to run the KNN model. It takes four arguments:
- train = training data features (no outcome)
- test = test data features (no outcome)
- cl = training data outcome (class each observation belongs to)
- k = number of nearest neighbors to use
For two-class classification problems, k should be odd (avoids tied votes). Let’s run the model with 1 NN and 9 NNs. To exclude a column, use -# in the column position insider square brackets. (e.g. df[, -2] excludes the second column of dataframe df)
<- knn(train = training_data[, -1],
knn_model1 test = test_data[, -1],
cl = training_data[, 1],
k = 1)
<- knn(train = training_data[, -1],
knn_model9 test = test_data[, -1],
cl = training_data[, 1],
k = 9)
We are trying to predict Private Yes/No. knn()
output predicted values for our test data, so we can compare actual v. predicted values. “prediction == actual” gives a vector with the same number of elements as there are observations in the test set. Each element will either be TRUE (the prediction was correct) or FALSE (the prediction was wrong). Applying which() to this vector will yield the index numbers for all the elements equal to TRUE. Applying length() to that vector tells us how many are TRUE (e.g. for how many observations prediction == actual). We can then divide by the number of observations in the test data to obtain the accuracy rate
<- length(which(knn_model1 == test_data$Private)) / nrow(test_data)
accuracy1 <- length(which(knn_model9 == test_data$Private)) / nrow(test_data)
accuracy9
print(accuracy1)
[1] 0.9096774
print(accuracy9)
[1] 0.9225806
Let’s visualize what is happening in a KNN classification model. We will use the ggplot2
package to create a scatterplot of the training data, and then overlay the test data on top of it. We will color the points by whether the school is private or not.
library(ggplot2)
ggplot(data = training_data,
aes(x = Outstate, y = F.Undergrad,
color = as.factor(Private))) +
geom_point() +
geom_point(data = test_data, aes(x = Outstate, y = F.Undergrad),
color = "black", size = 1) +
scale_color_manual(values = c("red", "blue")) +
theme_minimal() +
guides(color = guide_legend(title = "Private"))
This seems like excellent predictive performance. However, it’s good to think about the distribution of the data. As an extreme example, if all schools in the data were private, we would expect 100% prediction accuracy regardless of our model. Let’s see how well we do if our prediction is all schools are Private. Start by calculating the proportion of private schools
print(length(which(test_data$Private == "Yes")) / nrow(test_data))
[1] 0.716129
We can also check our accuracy on Private schools v. Public schools. To do this, we need to figure out which schools are private in the test data. Specifically, get the indices for the private schools
<- which(test_data$Private == "Yes")
private_schools <- which(test_data$Private == "No")
public_schools
print(private_schools)
[1] 1 2 3 8 11 12 13 14 15 16 18 19 20 21 22 23 24 27
[19] 28 29 30 31 32 33 34 35 37 38 39 41 42 43 44 45 47 48
[37] 49 50 51 52 53 55 56 57 58 60 61 62 63 64 65 66 68 69
[55] 76 77 78 80 81 82 84 85 86 90 91 92 93 94 96 97 99 100
[73] 101 102 104 106 107 108 110 112 113 116 119 120 121 122 123 125 127 128
[91] 129 130 133 134 135 136 137 138 139 140 142 145 146 147 148 149 151 152
[109] 153 154 155
print(public_schools)
[1] 4 5 6 7 9 10 17 25 26 36 40 46 54 59 67 70 71 72 73
[20] 74 75 79 83 87 88 89 95 98 103 105 109 111 114 115 117 118 124 126
[39] 131 132 141 143 144 150
To calculate the prediction accuracy for private schools, we need to know how many (true not predicted) private schools are in the test data. Likewise, we need to know how many public schools are in the test data.
<- length(private_schools)
num_private_schools <- length(public_schools) num_public_schools
Now we will calculate the prediction accuracy separately for private and public schools.
<- length(
private_accuracy1 which(knn_model1[private_schools] == test_data$Private[private_schools])) /
num_private_schools
<- length(
private_accuracy9 which(knn_model9[private_schools] == test_data$Private[private_schools])) /
num_private_schools
Now we will calculate the prediction accuracy separately for private and public schools.
## Private schools (% correctly predicted):
<- length(
private_accuracy1 which(knn_model1[private_schools] == test_data$Private[private_schools])) /
num_private_schools
<- length(
private_accuracy9 which(knn_model9[private_schools] == test_data$Private[private_schools])) /
num_private_schools
# Public schools (% correctly predicted):
<- length(
public_accuracy1 which(knn_model1[public_schools] == test_data$Private[public_schools])) /
num_public_schools
<- length(
public_accuracy9 which(knn_model9[public_schools] == test_data$Private[public_schools])) /
num_public_schools
Let’s see how it did on different school types:
print(private_accuracy1)
[1] 0.9459459
print(public_accuracy1)
[1] 0.8181818
print(private_accuracy9)
[1] 0.972973
print(public_accuracy9)
[1] 0.7954545
Therefore, we did better on private schools than public schools because our prediction accuracy was higher on private schools. Thinking about differential performance by label is related to fairness of machine learning algorithms. For an interesting discussion on ML fairness and different ways to define fairness, see the following academic paper:
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan.
Inherent Trade-Offs in the Fair
Determination of Risk Scores, November 2016
KNN for Regression
Suppose we wanted to predict how many students would enroll given the other features available in the data. In that case, the classification function we used above will not work. We will need a KNN function designed for regression problems. This function is knn.reg()
in the FNN
package, so we should install then read in the FNN package.
#install.packages("FNN")
library(FNN)
knn.reg() takes four arguments: - training data with only features (no outcome) - test data with only features (no outcome) - training outcomes - k = number of neighbors
Enrollment is the fourth column, so we will exclude that from the features. Because public / private is a factor, we either need to convert it to a numeric variable or exclude it. We will exclude it for now. Note that you can scale your features using scale()
. Deciding to scale your features or not is problem dependent. We will not scale here. If you’re not sure whether or not to scale, you can always try it both ways and see how the performance changes.
<- knn.reg(training_data[, -c(1, 4)],
knn_reg1 -c(1, 4)],
test_data[, $Enroll,
training_datak = 1)
<- knn.reg(training_data[, -c(1, 4)],
knn_reg5 -c(1, 4)],
test_data[,$Enroll,
training_datak = 5)
MSE is an appropriate loss function for regression whereas accuracy is only relevant for classification
<- mean((knn_reg1$pred - test_data$Enroll)^2)
mse_knn1 <- mean((knn_reg5$pred - test_data$Enroll)^2)
mse_knn5
print(mse_knn1)
[1] 124500.9
print(mse_knn5)
[1] 73296.56
Standard Linear Regression
We will now do linear regression. To run a linear regression in R, we use the function lm()
, which stands for linear model. lm()
takes two main arguments. The first is the formula, which should be of the form Dependent Variable ~ Feature1 + Feature2 + … The second is the training data – including both features and the outcome. Note that “~.
” means regress this variable on all other variables
<- lm(Enroll ~ ., training_data) enroll_reg
lm() returns a list, which includes among other things coefficients, residuals, and fitted values for the training data. You can look at the elements in RStudio by using the blue arrow next to enroll_reg in the environment tab. In order to call one element of a list, you can use $
$coefficients enroll_reg
(Intercept) PrivateYes Apps Accept Top10perc
187.938169332 7.806939217 -0.027759639 0.146504112 4.016271367
Top25perc F.Undergrad P.Undergrad Outstate Room.Board
-2.268522370 0.144249348 -0.011850547 -0.003105257 -0.024152785
Books Personal PhD Terminal S.F.Ratio
-0.027184975 0.008447046 -0.431202139 -0.539993156 -0.253400149
perc.alumni Expend Grad.Rate
2.319201329 0.003018132 0.135713594
In order to see a more traditional regression output, use summary()
summary(enroll_reg)
Call:
lm(formula = Enroll ~ ., data = training_data)
Residuals:
Min 1Q Median 3Q Max
-1284.27 -60.18 -8.62 51.46 1544.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 187.938169 89.614941 2.097 0.036393 *
PrivateYes 7.806939 30.036405 0.260 0.795017
Apps -0.027760 0.008149 -3.406 0.000702 ***
Accept 0.146504 0.014710 9.959 < 2e-16 ***
Top10perc 4.016271 1.283269 3.130 0.001834 **
Top25perc -2.268522 0.996912 -2.276 0.023222 *
F.Undergrad 0.144249 0.004298 33.560 < 2e-16 ***
P.Undergrad -0.011851 0.006753 -1.755 0.079809 .
Outstate -0.003105 0.004185 -0.742 0.458327
Room.Board -0.024153 0.010712 -2.255 0.024500 *
Books -0.027185 0.049391 -0.550 0.582244
Personal 0.008447 0.013655 0.619 0.536410
PhD -0.431202 1.002234 -0.430 0.667174
Terminal -0.539993 1.094362 -0.493 0.621887
S.F.Ratio -0.253400 2.843330 -0.089 0.929015
perc.alumni 2.319201 0.879334 2.637 0.008568 **
Expend 0.003018 0.002639 1.144 0.253219
Grad.Rate 0.135714 0.647956 0.209 0.834168
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 202.3 on 604 degrees of freedom
Multiple R-squared: 0.9558, Adjusted R-squared: 0.9546
F-statistic: 768.8 on 17 and 604 DF, p-value: < 2.2e-16
If you want to use the coefficients from enroll_reg to predict enrollment values in the test data, you can use the function predict()
. The first argument is the lm object (the whole thing – not just the coefficients) and the second argument is the test data frame without the outcome column
<- predict(enroll_reg, test_data[, -4]) predicted_enroll
Let’s see how well we did in terms of MSE
<- mean((predicted_enroll - test_data$Enroll)^2)
MSE_lm_enroll print(MSE_lm_enroll)
[1] 39312.19
We can see how this compared to our training MSE
print(mean((enroll_reg$residuals)^2))
[1] 39760.39
Training MSE as % of Test MSE:
print(mean((enroll_reg$residuals)^2) / MSE_lm_enroll)
[1] 1.011401
We know that the coefficients might change if we exclude some variables. Let’s pretend we only had Apps and Accept (columns 2 and 3) as features
<- lm(Enroll ~ Apps + Accept, training_data) small_enroll_reg
We can compare coefficients from the small regression and the full regression. If the coefficients in the small regression are different from the coefficients in the full regression, then the small regression suffers from Omitted Variables Bias (OVB).
$coefficients small_enroll_reg
(Intercept) Apps Accept
86.88115150 -0.05243254 0.42420181
$coefficients enroll_reg
(Intercept) PrivateYes Apps Accept Top10perc
187.938169332 7.806939217 -0.027759639 0.146504112 4.016271367
Top25perc F.Undergrad P.Undergrad Outstate Room.Board
-2.268522370 0.144249348 -0.011850547 -0.003105257 -0.024152785
Books Personal PhD Terminal S.F.Ratio
-0.027184975 0.008447046 -0.431202139 -0.539993156 -0.253400149
perc.alumni Expend Grad.Rate
2.319201329 0.003018132 0.135713594
Stargazer for Regression Output
If you want to compare the coefficients from different regressions, you can use the stargazer
package. This package is not installed by default, so you will need to install it.
#install.packages("stargazer")
library(stargazer)
stargazer(small_enroll_reg, enroll_reg,
type = "text", column.labels = c("Small Model", "Full Model"))
========================================================================
Dependent variable:
----------------------------------------------------
Enroll
Small Model Full Model
(1) (2)
------------------------------------------------------------------------
PrivateYes 7.807
(30.036)
Apps -0.052*** -0.028***
(0.013) (0.008)
Accept 0.424*** 0.147***
(0.021) (0.015)
Top10perc 4.016***
(1.283)
Top25perc -2.269**
(0.997)
F.Undergrad 0.144***
(0.004)
P.Undergrad -0.012*
(0.007)
Outstate -0.003
(0.004)
Room.Board -0.024**
(0.011)
Books -0.027
(0.049)
Personal 0.008
(0.014)
PhD -0.431
(1.002)
Terminal -0.540
(1.094)
S.F.Ratio -0.253
(2.843)
perc.alumni 2.319***
(0.879)
Expend 0.003
(0.003)
Grad.Rate 0.136
(0.648)
Constant 86.881*** 187.938**
(20.984) (89.615)
------------------------------------------------------------------------
Observations 622 622
R2 0.820 0.956
Adjusted R2 0.819 0.955
Residual Std. Error 403.989 (df = 619) 202.349 (df = 604)
F Statistic 1,405.761*** (df = 2; 619) 768.822*** (df = 17; 604)
========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
You can also also use stargazer to get the latex code for a table. This is useful if you want to include the table in a paper or a presentation.
stargazer(small_enroll_reg, enroll_reg,
type = "latex",
column.labels = c("Small Model", "Full Model"))
Dependent variable: | ||
Enroll | ||
Small Model | Full Model | |
(1) | (2) | |
PrivateYes | 7.807 | |
(30.036) | ||
Apps | -0.052^{***} | -0.028^{***} |
(0.013) | (0.008) | |
Accept | 0.424^{***} | 0.147^{***} |
(0.021) | (0.015) | |
Top10perc | 4.016^{***} | |
(1.283) | ||
Top25perc | -2.269^{**} | |
(0.997) | ||
F.Undergrad | 0.144^{***} | |
(0.004) | ||
P.Undergrad | -0.012^{*} | |
(0.007) | ||
Outstate | -0.003 | |
(0.004) | ||
Room.Board | -0.024^{**} | |
(0.011) | ||
Books | -0.027 | |
(0.049) | ||
Personal | 0.008 | |
(0.014) | ||
PhD | -0.431 | |
(1.002) | ||
Terminal | -0.540 | |
(1.094) | ||
S.F.Ratio | -0.253 | |
(2.843) | ||
perc.alumni | 2.319^{***} | |
(0.879) | ||
Expend | 0.003 | |
(0.003) | ||
Grad.Rate | 0.136 | |
(0.648) | ||
Constant | 86.881^{***} | 187.938^{**} |
(20.984) | (89.615) | |
Observations | 622 | 622 |
R^{2} | 0.820 | 0.956 |
Adjusted R^{2} | 0.819 | 0.955 |
Residual Std. Error | 403.989 (df = 619) | 202.349 (df = 604) |
F Statistic | 1,405.761^{***} (df = 2; 619) | 768.822^{***} (df = 17; 604) |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |