0

In the video lab section of Introduction to Statistical Learning, Chapter 3, there are functions to perform regression on one predictor variable. The presenter is Trevor Hastie. The relevant section starts at 19:22.

https://www.youtube.com/watch?v=gNZfqHhq_B4&list=PLoROMvodv4rOzrYsAxzQyHb8n_RWNuS1e&index=14

I would like to extend and modify the code so that the function works when there is a data.frame and the number of predictors is not known in advance. I do not want to use attach().

Below is a function from the presentation.

Then I create two data frames.

I also created three different interactive scenarios that I would like the function to be able to handle. I am looking for help on writing the function.

# Function By Trevor Hastie
library(ISLR2)
# version 1
regplot = function(x,y) {
  fit = lm(y~x)
  plot(x,y)
  abline(fit, col = "red")
}
attach(Carseats)
regplot(Price, Sales)

# My work

numRows = 100

df1 = data.frame(y = rnorm(numRows), x1 = rnorm(numRows))
df2 = data.frame(y = rnorm(numRows), x1 = rnorm(numRows), 
                 x2 = rnorm(numRows), x3 = rnorm(numRows) )

# Case 1
lm.fit1 = lm(y ~ x1, data = df1)
plot(lm.fit1$fitted.values, lm.fit1$residuals)
# Possible function call: regplot(df1, y, x1)

# Case 2
lm.fit2 = lm(y ~ x1 + x2, data = df2)
plot(lm.fit2$fitted.values, lm.fit1$residuals)
# Possible function call: regplot(df1, y, c(x1, x2))

# Case 3
lm.fit3 = lm(y ~ x1 + x2 + x3 + x2:x3, data = df2)
plot(lm.fit3$fitted.values, lm.fit1$residuals)
# Possible function call: regplot(df1, y, c(x1, x2, x3), inter = c(x2, x3))

user2738483
  • 147
  • 1
  • 2
  • 11
  • `reformulate()` is very useful for these applications. You've made your life harder by specifying the predictor variables symbolically/as data rather than as strings, though. – Ben Bolker Feb 20 '23 at 18:14
  • This is OK. ```regplot(df1, 'y', 'x1')``` . I am just looking for one function that can handle input when the input is (1) data.frame (2) response variable (3) an unspecified ahead of time number of predictor variables. I think I need more details than what I am seeing online for reformulate. – user2738483 Feb 20 '23 at 19:00

2 Answers2

2

I think you could do this with reformulate() as @BenBolker suggests, but you could also do it by writing a function that expects a formula as its first argument (like lm() does). Here's that way first. Note, that you provide the formula and data, those both populate a lm() and then the relevant elements from the result are plotted.

numRows = 100
set.seed(519)
df2 = data.frame(y = rnorm(numRows), x1 = rnorm(numRows), 
                 x2 = rnorm(numRows), x3 = rnorm(numRows) )

regplot <- function(form, data, ...){
  fit <- lm(form, data)
  yhat <- fit$fitted.values
  e <- fit$residuals
  plot(yhat, e, ...)
}

regplot(y ~ x1, df2)

regplot(y ~ x1 + x2, df2)

regplot(y ~ x1 + x2 + x3 + x2:x3, df2)

Here's the same function, but written with reformulate() to expect a character string giving the response and vector of indeterminate length giving the names of the predictor variables and a data frame. Here, reformulate() puts them all into a formula and then provides that to lm(). It's one extra step, but depending on your workflow, it could be better off doing it within the function.

regplot2 <- function(response, predictors, data, ...){
  form <- reformulate(predictors, response=response)
  fit <- lm(form, data)
  yhat <- fit$fitted.values
  e <- fit$residuals
  plot(yhat, e, ...)
}

regplot2("y", "x1", df2)

regplot2("y", c("x1", "x2"), df2)

regplot2("y", c("x1", "x2", "x3", "x2:x3"), df2)

Created on 2023-02-20 by the reprex package (v2.0.1)

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
0

After studying the accepted solutions, I did some research and found a useful post. Loop function to add large numbers of predictors in regression function

This enabled me to slightly modify the second solution using more common functions.

regplot3 <- function(response, predictors, data, ...){
  #form <- reformulate(predictors, response=response)
  form <- as.formula(paste(response, '~', paste(predictors, collapse = '+')))
  fit <- lm(form, data)
  yhat <- fit$fitted.values
  e <- fit$residuals
  plot(yhat, e, ...)
}
user2738483
  • 147
  • 1
  • 2
  • 11