-6

I would like to create a function who can work with any data frame, with a minimum number of columns (1) and maximum number of columns (n). The function has to do a simple linear regression for each of the independent variables. I know that I have to use the loop for (.), but I don't know how to use it. I try this, but it doesn't work:

>data1<-read.csv(file.choose(),header=TRUE,sep=",")
>n<-nrow(data1)
>PredictorVariables <- paste("x", 1:n, sep="")
>Formula <-paste("y ~ ", PredictorVariables, collapse=" + ",data=data1)
>lm(Formula, data=data1)
MBorg
  • 1,345
  • 2
  • 19
  • 38

2 Answers2

0

Here is an approach with lapply(), using the mtcars data set. We will selectmpg as the dependent variable, extract the remaining columns from the data set, and then use lapply() to run regression models on each element in the indepVars vector. The output from each model is saved to a list, including the name of the independent variable as well as the resulting model object.

indepVars <- names(mtcars)[!(names(mtcars) %in% "mpg")]

modelList <- lapply(indepVars,function(x){
     result <- lm(mpg ~ mtcars[[x]],data=mtcars)
     list(variable=x,model=result) 
})

# print the first model
modelList[[1]]$variable
summary(modelList[[1]]$model)

The extract operator [[ can then be used to print the content of any of the models.

...and the output:

> # print the first model
> modelList[[1]]$variable
[1] "cyl"
> summary(modelList[[1]]$model)

Call:
lm(formula = mpg ~ mtcars[[x]], data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
mtcars[[x]]  -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,    Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

> 

Responding to the comment from the original poster, here is the code necessary to encapsulate the above process within an R function. The function regList() takes a data frame name and a dependent variable string, and then proceeds to run regressions of the dependent variable on each of the remaining variables in the data frame passed to the function.

regList <- function(dataframe,depVar) {
     indepVars <- names(dataframe)[!(names(dataframe) %in% depVar)]
     
     modelList <- lapply(indepVars,function(x){
          message("x is: ",x)
          result <- lm(dataframe[[depVar]] ~ dataframe[[x]],data=dataframe)
          list(variable=x,model=result) 
     })
     modelList
}

modelList <- regList(mtcars,"mpg")
# print the first model
modelList[[1]]$variable
summary(modelList[[1]]$model)

One can extract a variety of content from the individual model objects. The output is as follows:

> modelList <- regList(mtcars,"mpg")
> # print the first model
> modelList[[1]]$variable
[1] "cyl"
> summary(modelList[[1]]$model)

Call:
lm(formula = dataframe[[depVar]] ~ dataframe[[x]], data = dataframe)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9814 -2.1185  0.2217  1.0717  7.5186 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     37.8846     2.0738   18.27  < 2e-16 ***
dataframe[[x]]  -2.8758     0.3224   -8.92 6.11e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared:  0.7262,    Adjusted R-squared:  0.7171 
F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

>
Len Greski
  • 10,505
  • 2
  • 22
  • 33
  • but I need a function that receives the following arguments: A dataframe, the column number of the response variable, the number of the minimum explanatory variable column, the number of the maximum explanatory variable column. for example: function1 (df, 1, 2,10) 1 is the column of the response variable, the explanatory variables are located on columns from 2 to 10 inclusively – jean-philippe Jan 29 '18 at 02:36
  • The function makes a simple linear regression for the response variable and the set of explanatory variables of the dataframe individually (y ~ x1, y ~ x2, ... etc) It returns the diagnostic charts for each of these regressions (2 rows, 2columns). The function must be applicable regardless of the data frame submitted as a generalizable argument to all dataframes – jean-philippe Jan 29 '18 at 02:36
  • 1
    @jean-philippe - the additional context in your comments above should have been included in your question, along with a [Minimal, Complete, and Verifiable Example](https://stackoverflow.com/help/mcve). That said, I updated my answer to include an R function that allows one to specify a data frame name and a dependent variable name, rather than column numbers. The answer can easily be tweaked to use column numbers. – Len Greski Jan 29 '18 at 03:53
  • sorry, i am just beginner on R and its too difficult for me. your help is really appreciated. – jean-philippe Jan 29 '18 at 04:07
  • i need something more general using column numbers to work with any name of the independent or dépendante variable in any data frame. – jean-philippe Jan 29 '18 at 04:10
  • @jean-philippe - the function I posted, `regList()`, is already generalized. One passes a data frame name and dependent variable name as arguments to the function, and the function generates linear models for all other variables in the data frame. I simply used `mtcars` as an example, since you didn't post a verifiable example in your question. – Len Greski Feb 03 '18 at 13:46
0

How about the following:

First, I create some sample data:

# Sample data
set.seed(2017);
x <- sapply(1:10, function(x) x * seq(1:100) + rnorm(100));
df <- data.frame(Y = rowSums(x), x);

Next I define a custom function:

# Custom function where
#  df is the source dataframe
#  idx.y is the column index of the response variable in df
#  idx.x.min is the column index of the first explanatory variable
#  idx.x.max is the column index of the last explanatory variable
# The function returns a list of lm objects
myfit <- function(df, idx.y, idx.x.min, idx.x.max) {
    stopifnot(idx.x.min < idx.x.max, idx.x.max <= ncol(df));
    res <- list();
    for (i in idx.x.min:idx.x.max) {
        res[[length(res) + 1]] <- lm(df[, idx.y] ~ df[, i]);
    }
    return(res);
}

Then I run myfit using the sample data.

lst <- myfit(df, 1, 2, 11);

The return object lst is a list of 11-2+1 = 10 fit results of class lm. For example,

lst[[1]];
#
#Call:
#lm(formula = df[, idx.y] ~ df[, i])
#
#Coefficients:
#(Intercept)      df[, i]
#     -5.121       55.100

PS

For future posts I recommend having a look at how to ask good questions here on SO, and providing a minimal reproducible example/attempt, including sample data.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68