8

This is something which data analysts do all the time (especially when working with survey data which features missing responses.) It's common to first multiply impute a set of compete data matrices, fit models to each of these matrices, and then combine the results. At the moment I'm doing things by hand and looking for a more elegant solution.

Imagine there's 5 *.csv files in the working directory, named dat1.csv, dat2.csv, ... dat5.csv. I want to estimate the same linear model using each data set.

Given this answer, a first step is to gather a list of the files, which I do with the following

csvdat <- list.files(pattern="dat.*csv")

Now I want to do something like

for(x in csvdat) {
    lm.which(csvdat == "x") <- lm(y ~ x1 + x2, data = x)
}

The "which" statement is my silly way of trying to number each model in turn, using the location in the csvdat list the loop is currently up to. that is, I'd like this loop to return a set of 5 lm objects with the names lm.1, lm.2, etc

Is there some simple way to create these objects, and name them so that I can easily indicate which data set they correspond to?

Thanks for your help!

Community
  • 1
  • 1
tomw
  • 3,114
  • 4
  • 29
  • 51

3 Answers3

10

Use a list to store the results of your regression models as well, e.g.

foo <- function(n) return(transform(X <- as.data.frame(replicate(2, rnorm(n))), 
                                                       y = V1+V2+rnorm(n)))
write.csv(foo(10), file="dat1.csv")
write.csv(foo(10), file="dat2.csv")
csvdat <- list.files(pattern="dat.*csv")
lm.res <- list()
for (i in seq(along=csvdat))
  lm.res[[i]] <- lm(y ~ ., data=read.csv(csvdat[i]))
names(lm.res) <- csvdat
chl
  • 27,771
  • 5
  • 51
  • 71
  • +1 that is definitely a good and Rish way. But, the OP wanted single variables, so `assign` is the way to go... – Henrik May 26 '11 at 21:33
  • @Henrik Agree. (btw, you could replace your `paste` command with `sep="."`, but I won't be able to upvote more :-) – chl May 26 '11 at 21:42
  • @chl good point with `sep="."`. Furthermore, loading the data in the call to `lm` is also pretty neat I think (then one could put everything in one line!). – Henrik May 26 '11 at 21:45
  • @chl thanks for the insight on loading data within the lm call. – tomw May 26 '11 at 23:21
  • Definitely the way to go. Just because a client asks you for an inefficient implementation that will require more work for him doesn't mean you shouldn't offer a better one. – IRTFM May 27 '11 at 05:08
  • 1
    just had the idea of enhancing this with preallocating : `lm.res <- vector("list", length(csvdat))` would potentially save some time on bigger operations. – Henrik May 27 '11 at 06:58
10

Another approach is to use the plyr package to do the looping. Using the example constructed by @chl, here is how you would do it

require(plyr)

# read csv files into list of data frames
data_frames = llply(csvdat, read.csv)

# run regression models on each data frame
regressions = llply(data_frames, lm, formula = y ~ .)
names(regressions) = csvdat
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • More elegant solution than mine. My +1 to come, but see @Henrik response which is what the OP was looking for. – chl May 26 '11 at 21:44
  • @ chi Both your solution and Ramnath's will serve the questioner better in the long run. – IRTFM May 27 '11 at 05:11
8

what you want is a combination of the functions seq_along() and assign()

seq_along helps creates a vector from 1 to 5 if there are five objects in csvdat (to get the appropriate numbers and not only the variable names). Then assign (using paste to create the appropriate astrings from the numbers) lets you create the variable.

Note that you will also need to load the data file first (was missing in your example):

for (x in seq_along(csvdat)) {
    data.in <- read.csv(csvdat[x])   #be sure to change this to read.table if necessary
    assign(paste("lm.", x, sep = ""), lm(y ~ x1 + x2, data = data.in))
}

seq_along is not totally necessary, there could be other ways to solve the numeration problem.

The critical function is assign. With assign you can create variables with a name based on a string. See ?assign for further info.


Following chl's comments (see his post) everything in one line:

for (x in seq_along(csvdat)) assign(paste("lm", x, sep = "."), lm(y ~ x1 + x2, data = read.csv(csvdat[x]))
Henrik
  • 14,202
  • 10
  • 68
  • 91
  • Good that you point to `assign`. (I'm out of votes today, but be sure I'll upvote ASAP.) – chl May 26 '11 at 21:30