Challenge: recoding a data.frame() — make it faster

Question

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.

The fastest code that accomplishes the same task with the provided sample data by system.time() on my machine wins.

## Sample data
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")

Code to optimize.

for(x in 1:ncol(dat)) { 
    dat[,x] <- factor(dat[,x], labels=re.codes)
    }

Current system.time():

   user  system elapsed 
   4.40    0.10    4.49

Hint: dat <- lapply(1:ncol(dat), function(x) dat[,x] <- factor(dat[,x],labels=rc))) is not any faster.

+1 Brandon, this is a brilliant question. I have observed the same problem with my survey data, with some tasks taking 11 seconds, on occasion. Thank you. — Andrie, May 27 '11 at 05:42
I'm not going to lie, it's a bit of a self-serving challenge but a fun game nevertheless! — Brandon Bertelsen, May 27 '11 at 05:45
Brandon, yes I know. Its' been broken for about 24 hours, but I had another emergency to sort out first. I almost had a heart attack sorting out a live survey that went dramatically wrong. But thanks for the heads up — Andrie, May 27 '11 at 08:26
Not a big fan of these micro-optimisation questions. Speed comes a distant third after correctness and maintainability — hadley, May 27 '11 at 18:46
@hadley: Although speed isn't a concern for you, it's likely a concern for @Brandon, else he wouldn't have asked the question. It's his decision whether to trade readability / maintainability for speed. Perhaps speed is a close second to correctness for him. — Joshua Ulrich, May 27 '11 at 19:12
@hadley I understand your point. Personally, I like these questions because they tease out the collective wisdom of the community. I never fail to learn something from the answers. In any case, there's a bunch more planned - so you can feel free to downvote those too :) — Brandon Bertelsen, May 27 '11 at 20:26
In my mind there's a big difference between fast enough and fast. I think sacrificing maintainability for speed is perilous. I've always regretted it when I come back to code I wrote for speed and then have no idea how it works. — hadley, May 27 '11 at 22:24
I'm just trying to share my experiences. No judgement attached. — hadley, May 27 '11 at 22:25
@hadley, no judgement received nor given on my part. I was really hoping the smiley face would convey that. — Brandon Bertelsen, May 27 '11 at 22:29
I agree with Hadley that correctness and maintainability should come before speed; however, I too find questions like this helpful because I often discover ways to improve all three, by learning new a function, or seeing a more R-ish way of doing something, and the like. — Aaron left Stack Overflow, May 29 '11 at 12:41

score 10 · Answer 1 · answered May 27 '11 at 03:37

10

My computer is obviously much slower, but structure is a pretty fast way to do this:

> system.time({
+ dat1 <- dat
+ for(x in 1:ncol(dat)) {
+   dat1[,x] <- factor(dat1[,x], labels=re.codes)
+   }
+ })
   user  system elapsed 
 11.965   3.172  15.164 
> 
> system.time({
+ m <- as.matrix(dat)
+ dat2 <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
+ })
   user  system elapsed 
  2.100   0.516   2.621 
> 
> system.time(dat3 <- data.frame(lapply(dat, structure, class='factor', levels=re.codes)))
   user  system elapsed 
  0.484   0.332   0.820 

# this isn't because the levels get re-ordered
> all.equal(dat1, dat2)

> all.equal(dat1, dat3)
[1] TRUE

answered May 27 '11 at 03:37

Charles

4,389
2
16
13

+1 system.time() = 0.56 seconds, interesting use of `structure` and `lapply`. Also saves time converting to and from `as.matrix()` – Brandon Bertelsen May 27 '11 at 03:46
6

Changing from `data.frame` to `as.data.frame` should give you speed-up. – Marek May 27 '11 at 13:11
w/ as.data.frame system.time = 0.25 seconds, great pointer Marek! – Brandon Bertelsen May 27 '11 at 18:15

score 10 · Accepted Answer · edited May 23 '17 at 11:45

10

Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:

system.time({
  dat3 <- list()
  # define attributes once outside of loop
  attrib <- list(class="factor", levels=re.codes)
  for (i in names(dat)) {              # loop over each column in 'dat'
    dat3[[i]] <- as.integer(dat[[i]])  # convert column to integer
    attributes(dat3[[i]]) <- attrib    # assign factor attributes
  }
  # convert 'dat3' into a data.frame. We can do it like this because:
  # 1) we know 'dat' and 'dat3' have the same number of rows and columns
  # 2) we want 'dat3' to have the same colnames as 'dat'
  # 3) we don't care if 'dat3' has different rownames than 'dat'
  attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
    class="data.frame", names=names(dat))
})
identical(dat2, dat3)  # 'dat2' is from @Dwin's answer

edited May 23 '17 at 11:45

Community

1
1

answered May 27 '11 at 14:11

Joshua Ulrich

173,410
32
338
418

+1 system.time() = 0.08. What just happened? I'd really appreciate a detailed explanation of this one Josh. – Brandon Bertelsen May 27 '11 at 14:34
1

The only work is does is to convert the data to integers; beyond that all it does is add attributes for each column and to the whole to make the columns into factors and the whole into a data.frame. It's faster because it doesn't do any of the usual checks to make sure that the resulting factors and data.frame are sensible. – Aaron left Stack Overflow May 27 '11 at 15:03
@Brandon: @Aaron is spot-on. The `as.integer` call is slightly faster than @DWin's `storage.mode` approach. The rest of the gains come from skipping all the checks, which assumes the original `dat` is a sensible data.frame. – Joshua Ulrich May 27 '11 at 15:09
Could you be more explicit in what you mean by "sensible"? – Brandon Bertelsen May 27 '11 at 15:14
1

Rather than take my word for it, see the first two paragraphs in the Details section of `?data.frame` (basically, columns have the same number of rows, row names are unique, column names exist and are unique. I assume unique column names even though data.frames are not *required* to have them). – Joshua Ulrich May 27 '11 at 15:19

score 8 · Answer 3 · answered May 27 '11 at 02:26

8

Try this:

m <- as.matrix(dat)

dat <- data.frame( matrix( re.codes[m], nrow = nrow(m)))

answered May 27 '11 at 02:26

Prasad Chalasani

19,912
7
51
73

+1 system.time() = 1.04 seconds, clever solution using `matrix` and `re.codes[m]` – Brandon Bertelsen May 27 '11 at 02:58

score 7 · Answer 4 · answered Sep 25 '12 at 12:13

A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.

The test data again :

dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000)) 
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat) 
dat <- as.data.frame(dat) 
re.codes <- c("This","That","And","The","Other")

Now change the class and set the levels of each column directly, by reference :

require(data.table)
system.time(for (i in 1:ncol(dat)) {
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
}
# user  system elapsed 
#   0       0       0 

identical(dat, <result in question>)
# [1] TRUE

Does 0.00 win? As you increase the size of the data, this method stays at 0.00.

Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.

So, strictly with the input data in the question, and including the double to integer conversion :

dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))             
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)               
dat <- as.data.frame(dat)               
re.codes <- c("This","That","And","The","Other")           

system.time(for (i in 1:ncol(dat)) {
  if (!is.integer(dat[[i]]))
      set(dat,j=i,value=as.integer(dat[[i]]))
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
})
#  user  system elapsed
#  0.06    0.01    0.08      # on my slow netbook

identical(dat, <result in question>)
# [1] TRUE

Note that set also works on data.frame, too. You don't have to convert to data.table to use it.

These are very small times, clearly. Since it's only a small input dataset :

dim(dat)
# [1] 250000     36 
object.size(dat)
# 68.7 Mb

Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.

The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.

+1 very cool. I guess technically, all that's really happening here is that you're storing a few values of text for each column. Hence the speed. I wonder if this could be done without the use of data.table or bit packages, just by setting attributes. — Brandon Bertelsen, Sep 25 '12 at 15:49
@BrandonBertelsen Exactly. It's just changing the attributes by reference. Afaik, no, not possible without either data.table or bit. The reason, I think, is that it breaks standard R practice. Say you have `dat2<-dat` beforehand. `setattr` will change both `dat2` and `dat`. Base R methods all copy at least some of the memory, at least once, and sometimes all of it many times, to uphold copy-on-write. Even when there is only one `dat` and there is no need to copy it at all. `setattr` and `set`, when used on a `data.frame`, could be considered _dangerous_ by some, for this reason. — Matt Dowle, Sep 25 '12 at 16:10
@BrandonBertelsen `setattr` is nothing other than a wrapper to R's `setAttrib` function at C level. So you can `.Call` (or similar) to that, too, directly yourself, but not possible in base R (afaik) is what I meant. Copies were reduced in recent versions of R, but not down to zero. — Matt Dowle, Sep 25 '12 at 16:23

IRTFM · Answer 5 · 2011-05-27T13:09:30.947

6

The help page for class() says that class<- is deprecated and to use as. methods. I haven't quite figured out why the earlier effort was reporting 0 observations when the data was obviously in the object, but this method results in a complete object:

    system.time({ dat2 <- vector(mode="list", length(dat))
      for (i in 1:length(dat) ){ dat2[[i]] <- dat[[i]]
        storage.mode(dat2[[i]]) <- "integer"
               attributes(dat2[[i]]) <- list(class="factor", levels=re.codes)}
  names(dat2) <- names(dat)
  dat2 <- as.data.frame(dat2)})
#--------------------------  
  user  system elapsed 
  0.266   0.290   0.560 
> str(dat2)
'data.frame':   250000 obs. of  36 variables:
 $ V1 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V2 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ V3 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
 $ V4 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V5 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ V6 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
 $ V7 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V8 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 snipped

All 36 columns are there.

edited May 27 '11 at 13:09

answered May 27 '11 at 04:48

IRTFM

258,963
21
364
487

I'd love to know why this is so much faster than Charles's solution (lapply + structure), at least on my machine (1.055sec vs 0.34sec). – joran May 27 '11 at 05:11
1

This returns an empty data frame for me. It does do it quickly though! – Brandon Bertelsen May 27 '11 at 05:21
@Brandon Please check your results. I just reran it and posted results. – IRTFM May 27 '11 at 11:50
@DWin From your answer: `'data.frame': 0 obs. of 36 variables`. It's empty. `class(dat2) <- "data.frame"` is causing this. With `dat2<-as.data.frame(dat2)` it works (slower but still faster then Charles` – Marek May 27 '11 at 12:36
@joran: I think the reason it's faster is that I am not reprocessing the numeric vectors. I'm just working "outside" them on their attributes. Now that I think about it, I wonder if I can just do that on the original and skip the copying step? – IRTFM May 27 '11 at 13:08
1

@Dwin, @joran As I see there is no difference in timings when use: `system.time(as.data.frame(lapply(dat, structure, class='factor', levels=re.codes)))` – Marek May 27 '11 at 13:31
+1 system.time() = 0.4 seconds I have no idea what you did here but it's hella fast (and now works as expected)! – Brandon Bertelsen May 27 '11 at 14:28

score 3 · Answer 6 · edited May 23 '17 at 10:28

Making factors is expensive; only doing it once is comparable with the commands using structure, and in my opinion, preferable as you don't have to depend on how factors happen to be constructed.

rc <- factor(re.codes, levels=re.codes)
dat5 <- as.data.frame(lapply(dat, function(d) rc[d]))

EDIT 2: Interestingly, this seems to be a case where lapply does speed things up. This for loop is substantially slower.

for(i in seq_along(dat)) {
  dat[[i]] <- rc[dat[[i]]]
}

EDIT 1: You can also speed things up by being more precise with your types. Try any of the solutions (but especially your original one) creating your data as integers, as follows. For details, see a previous answer of mine here.

dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))

This is also a good idea as converting to integers from floating points, as is being done in all of the faster solutions here, can give unexpected behavior, see this question.

Type modification in original loop to `for(x in 1:ncol(dat)) dat[,x] <- factor(as.integer(dat[,x]), labels=re.codes)` speed up execution significantly. — Marek, May 27 '11 at 15:49
Nice suggestion, Marek, and good clarification that the speedup comes much more from running `factor` on an integer rather than from starting with integers. — Aaron left Stack Overflow, May 27 '11 at 15:57

Challenge: recoding a data.frame() — make it faster

6 Answers6

Linked