2

Possible Duplicate:
Why are loops slow in R?

Consider the following task. A dataset has 40 variables for 20,000 "users". Each user has between 1 and 150 observations. All users are stacked in a matrix called data. The first column is the id of the user and identifies the user. All id are stored in a 20,000 X 1 matrix called userid.

Consider the following R code

useridl = length(userid)
itime=proc.time()[3]    
for (i in 1:useridl) {
temp =data[data[,1]==userid[i],]
   }
 etime=proc.time()[3]
 etime-itime

This code just goes through the 20,000 users, creating the temp matrix every time. With the subset of observations belonging to userid[i]. It takes about 6 minutes in a MacPro.

In MatLab, the same task

tic
for i=1:useridl
temp=data(data(:,1)==userid(i),:);
end
toc

takes 1 minute.

Why is R so much slower? This is standard task, I am using matrices in both cases. Any ideas?

Community
  • 1
  • 1
Hernan
  • 471
  • 1
  • 4
  • 8
  • 3
    There are almost certainly (much) better ways to do that in R. Create a [small toy example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (e.g. with 5 variables, 2 users, and 3 observations each), and someone here will show you one of them. – Josh O'Brien Oct 30 '12 at 15:36
  • To add to Josh's point, try not to use `data` as this is a defined word in R. Also, I believe wrapping your function in `system.time()` is more accurate than adding and subtracting two `proc.time()` calls. – Brandon Bertelsen Oct 30 '12 at 15:44
  • 8
    When people compare speeds between languages with "identical" code, what they often end up comparing is their programming skill in each language. (i.e. if I started comparing R to Matlab, I'd likely be shocked at how slow Matlab is.) – joran Oct 30 '12 at 15:47
  • 4
    @user1786009 you might ruffle fewer feathers in the R crowd if you phrased your question differently. The way you've put it suggests R falls short on a 'standard task', implying a presumption on your part that the way you've done it is the right way to do it in R. Rather than 'Why is this task slower in R than MATLAB?', a more neutral way to frame this question might be, 'Why is this construct slower in R than Matlab?'. The latter might yield insight to the comparative workings of the two languages, while the former is likely to yield attempts to show that R is not inferior. – Matthew Plourde Oct 30 '12 at 16:14

1 Answers1

6

As @joran commented, that's bad R practice. Instead of repeatedly subsetting your original matrix, just put the subsets in a list once and then iterate over the list with lapply or similar.

# make example data
set.seed(21)
userid <- 1:1e4
obs <- sample(150, length(userid), TRUE)
users <- rep(userid, obs)
Data <- cbind(users,matrix(rnorm(40*sum(obs)),sum(obs),40))

# reorder so Data isn't sorted by userid
Data <- Data[order(Data[,2]),]
# note that you have to call the data.frame method explicitly,
# the default method returns a vector
system.time(temp <- split.data.frame(Data, Data[,1])) ## Returns times in seconds
#    user  system elapsed 
#    2.84    0.08    2.92 

My guess is that the garbage collector is slowing down your R code, since you're continually overwriting the temp object.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Good to know about `split.data.frame` as applied to matrices. Very nice. – Josh O'Brien Oct 30 '12 at 16:20
  • @JoshO'Brien: yeah, it caught me off-guard, so I actually had to read the documentation. – Joshua Ulrich Oct 30 '12 at 16:21
  • And I actually needed your hint to read the documentation... I guess they have `split.data.frame` as the workaround since the `split()`'s `drop=` argument is already being used to direct dropping (or not) of unused factor levels. – Josh O'Brien Oct 30 '12 at 16:26
  • Thanks Joshua and Josh! I did not know about the split.data.frame function and it worked wonders. The whole thing took 10 seconds. Incredible. My learning from this is that I should learn to use the data frame functions. I spent my young years with MatLab so I still carry old habits. My co-researcher and I are in awe. ;) – Hernan Oct 30 '12 at 17:07
  • @user1786009: No, no, no, this does not mean you should use `data.frame` methods!!! data.frames are much slower than matrices, the only reason I used the data.frame method of `split` is because the default method only returns a list of vectors, not a list of matrices. Please read `?split` to understand. – Joshua Ulrich Oct 30 '12 at 17:10
  • `users <- rep(userid, obs)` is a simpler and quicker way than `do.call(c,mapply(rep,userid,obs))` – mnel Oct 30 '12 at 22:10
  • @mnel: very true, edited. I think you should be able to edit answers; feel free to edit mine. – Joshua Ulrich Oct 30 '12 at 23:43