0

I'm writing a script that has to build a large matrix. I want to take a vector of names for each name get data from a different data frame do some operations on it, and then return a vector of data for that name. for example:

allNew=matrix(ncol=ncol(X)-1);
for(name in list)
    {
    tmpdata=all[grep(names,list$Names),];
    data=(as.data.frame(apply(tmpdata[,2:(ncol(tmpdata)-1)],2,sum))==nrow(tmpdata))*1
    colnames(data)=name;
        data=t(data);
        allNew=rbind(allNew,data);
    }

the length of the names list is in the 10000 range, and for each name tmpdata has 1-5 rows. I'm running my code on my labs linux server with about 8 GB ram,
somehow I feel this is taking a lot longer than it should, it takes a few minutes. How can I do this more efficiently?

Anthon
  • 69,918
  • 32
  • 186
  • 246
SivanG
  • 9
  • 2
  • Here is a similar question:http://stackoverflow.com/questions/5980240/performance-of-rbind-data-frame – Jonas Tundo Apr 07 '13 at 07:11
  • 2
    Don't grow a matrix inside a loop. Make it the final size at the start and then if you have to use a loop, just assign into its columns as you go. – Glen_b Apr 07 '13 at 07:20
  • Also, your `apply` can be replaced by the much faster `colSums`, and if you go for the preallocated matrix, `as.data.frame`, `colnames<-`, and the transpose are not necessary. – cbeleites unhappy with SX Apr 07 '13 at 09:57

1 Answers1

1

As the comments pointed out, growing an object one line at a time is much slower than overwriting parts of a pre-allocated object. Something like this should work--though without any test data it's hard to be sure.

allNew=matrix(NA, ncol=ncol(X)-1, nrow = length(list));
for(i in 1:length(list))
    {
    name <- names(list)[i]
    tmpdata=all[grep(names,list$Names), ]
    data=(as.data.frame(apply(tmpdata[, 2:(ncol(tmpdata)-1)], 2, sum))==nrow(tmpdata))*1
    colnames(data)=name
    allNew[i, ] = t(data)
    }
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you for the quick reply! about an hour after writing the question I realized myself that it was faster to pre-allocate the matrix and over write line by line – SivanG Apr 07 '13 at 08:39
  • 1
    @user2253904 Search for "The R Inferno" by Patrick Burns. Eye opener. – Roman Luštrik Apr 07 '13 at 09:36
  • @user2253904, please vote and accept that answer if you are happy with it. – flodel Apr 07 '13 at 12:01