6

I have a piece of code and total elapsed time is around 30 secs of which, the following code is around 27 secs. I narrowed the offending code to this:

d$dis300[i] <- h

So I change to this other piece and is now working really fast (as expected).

My question is why this is too slow against the second. The datos DF is around 7500x18 vars

First: (27 sec elapsed)

d$dis300 <- 0
for (i in 1:netot) {
  h <- aaa[d$ent[i], d$dis[i]]
  if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
  d$dis300[i] <- h
}

Second: (0.2 sec elapsed)

d$dis300 <- 0
for (i in 1:netot) {
  h <- aaa[d$ent[i], d$dis[i]]
  if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", d$ent[i], d$dis[i]))
  foo[i] <- h
}
d$foo <- foo

You can see both are the "same" but the offending one has this DF instead of a single vector.

Any comment is really appreciated. I came from another type of languages and this drove me nuts for a while. At least I have solution but I like to prevent this kind of issues in the future.

Thanks for your time,

Arun
  • 116,683
  • 26
  • 284
  • 387
notuo
  • 1,091
  • 2
  • 9
  • 15
  • Just to clarify, the difference in speed between the two is 30sec vs 27sec, and you consider this a dramatic speed-up? – joran Apr 24 '12 at 23:49
  • If @joran's relative timings are correct (and when is he wrong? :-) ), you'll get a lot better speed-ups by adopting these habits and techniques: http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 – Ari B. Friedman Apr 25 '12 at 00:10
  • @gsk3 Daily, according to my wife. – joran Apr 25 '12 at 00:25
  • Seriously, I was going to suggest that the modest speed-up seen here was due to R's copy-on-assignment behavior: in one case copying a whole data frame, in the other only a vector, hence a small speed increase. And before you recoil in horror at the thought of copy-on-assignment, keep in mind that "coding smart" by following gsk3's advice means it usually isn't an issue. – joran Apr 25 '12 at 00:27
  • Sorry. Here is the correct timing: The whole script delays 30 sec and the first par delays 27 sec. If we compare the first against the second, the times are First: 27 sec, Second .2 sec. This is just an excerpt of the whole script. – notuo Apr 25 '12 at 00:27
  • @notuo But the first sentence of your question still comes across that the difference is 30sec vs 27sec. Can you edit that please? – Matt Dowle Apr 25 '12 at 09:31

2 Answers2

10

The reason is that d$dis300[i] <- h calls $<-.data.frame.

It's a rather complex function as you can see:

`$<-.data.frame`

You don't say what foo is, but if it is an atomic vector, the $<- function is implemented in C for speed.

Still, I hope you declare foo as follows:

foo <- numeric(netot)

This will ensure you don't need to reallocate the vector for each assignment in the loop:

foo <- 0 # BAD!
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 4.40 secs
foo <- numeric(5e4) # Pre-allocate
system.time( for(i in 1:5e4) foo[i] <- 0 ) # 0.09 secs

Using the *apply family instead you don't worry about that:

d$foo <- vapply(1:netot, function(i, aaa, ent, dis) {
  h <- aaa[ent[i], dis[i]]
  if (h == 0) writeLines(sprintf("ERROR. ent:%i dis:%i", ent[i], dis[i]))
  h
}, numeric(1), aaa=aaa, ent=d$ent, dis=d$dis)

...here I also extracted d$ent and d$dis outside the loop which should improve things a bit too. Can't run it myself though since you didn't give reproducible data. But here's a similar example:

d <- data.frame(x=1)
system.time( vapply(1:1e6, function(i) d$x, numeric(1)) )         # 3.20 secs
system.time( vapply(1:1e6, function(i, x) x, numeric(1), x=d$x) ) # 0.56 secs

... but finally it seems it can all be reduced to (barring your error detection code):

d$foo <- aaa[cbind(d$ent, d$dis)]
Tommy
  • 39,997
  • 12
  • 90
  • 85
2

Tommy's is the best answer. This was too big for comment so adding it as an answer...

This is how you can see the copies (of the whole of DF, as joran commented) yourself :

> DF = data.frame(a=1:3,b=4:6)
> tracemem(DF)
[1] "<0x0000000003104800"
> for (i in 1:3) {DF$b[i] <- i; .Internal(inspect(DF))}
tracemem[0000000003104800 -> 000000000396EAD8]: 
tracemem[000000000396EAD8 -> 000000000396E4F0]: $<-.data.frame $<- 
tracemem[000000000396E4F0 -> 000000000399CDC8]: $<-.data.frame $<- 
@000000000399CDC8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
  @000000000399CD90 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @000000000399CCE8 13 INTSXP g0c2 [] (len=3, tl=0) 1,5,6
ATTRIB: # .. snip ..

tracemem[000000000399CDC8 -> 000000000399CC40]: 
tracemem[000000000399CC40 -> 000000000399CAB8]: $<-.data.frame $<- 
tracemem[000000000399CAB8 -> 000000000399C9A0]: $<-.data.frame $<- 
@000000000399C9A0 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
  @000000000399C968 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @000000000399C888 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,6
ATTRIB: # .. snip ..

tracemem[000000000399C9A0 -> 000000000399C7E0]: 
tracemem[000000000399C7E0 -> 000000000399C700]: $<-.data.frame $<- 
tracemem[000000000399C700 -> 00000000039C78D8]: $<-.data.frame $<- 
@00000000039C78D8 19 VECSXP g0c2 [OBJ,NAM(2),TR,ATT] (len=2, tl=0)
  @00000000039C78A0 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
  @0000000003E07890 13 INTSXP g0c2 [] (len=3, tl=0) 1,2,3
ATTRIB: # .. snip ..
> DF
  a b
1 1 1
2 2 2
3 3 3

Each of those tracemem[] lines corresponds to a copy of the whole object. You can see the hex addresses of the a column vector changing each time, too, despite it not being modifed by the assignment to b.

AFAIK, without dropping into C code yourself, the only way (currently) in R to modify an item of a data.frame with no copy of any memory at all, is the := operator and set() function, both in package data.table. There are 17 questions about assigning by reference using := here on Stack Overflow.

But in this case Tommy's one liner is definitely best as you don't even need a loop at all.

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224