15

Recoding is a common practice for survey data, but the most obvious routes take more time than they should.

The fastest code that accomplishes the same task with the provided sample data by system.time() on my machine wins.

## Sample data
dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)
dat <- as.data.frame(dat)
re.codes <- c("This","That","And","The","Other")

Code to optimize.

for(x in 1:ncol(dat)) { 
    dat[,x] <- factor(dat[,x], labels=re.codes)
    }

Current system.time():

   user  system elapsed 
   4.40    0.10    4.49 

Hint: dat <- lapply(1:ncol(dat), function(x) dat[,x] <- factor(dat[,x],labels=rc))) is not any faster.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • 3
    +1 Brandon, this is a brilliant question. I have observed the same problem with my survey data, with some tasks taking 11 seconds, on occasion. Thank you. – Andrie May 27 '11 at 05:42
  • I'm not going to lie, it's a bit of a self-serving challenge but a fun game nevertheless! – Brandon Bertelsen May 27 '11 at 05:45
  • @Andrie, ps: your website is broken :) – Brandon Bertelsen May 27 '11 at 05:50
  • Brandon, yes I know. Its' been broken for about 24 hours, but I had another emergency to sort out first. I almost had a heart attack sorting out a live survey that went dramatically wrong. But thanks for the heads up – Andrie May 27 '11 at 08:26
  • Not a big fan of these micro-optimisation questions. Speed comes a distant third after correctness and maintainability – hadley May 27 '11 at 18:46
  • 1
    @hadley: Although speed isn't a concern for you, it's likely a concern for @Brandon, else he wouldn't have asked the question. It's his decision whether to trade readability / maintainability for speed. Perhaps speed is a close second to correctness for him. – Joshua Ulrich May 27 '11 at 19:12
  • 2
    @hadley I understand your point. Personally, I like these questions because they tease out the collective wisdom of the community. I never fail to learn something from the answers. In any case, there's a bunch more planned - so you can feel free to downvote those too :) – Brandon Bertelsen May 27 '11 at 20:26
  • In my mind there's a big difference between fast enough and fast. I think sacrificing maintainability for speed is perilous. I've always regretted it when I come back to code I wrote for speed and then have no idea how it works. – hadley May 27 '11 at 22:24
  • I'm just trying to share my experiences. No judgement attached. – hadley May 27 '11 at 22:25
  • @hadley, no judgement received nor given on my part. I was really hoping the smiley face would convey that. – Brandon Bertelsen May 27 '11 at 22:29
  • I agree with Hadley that correctness and maintainability should come before speed; however, I too find questions like this helpful because I often discover ways to improve all three, by learning new a function, or seeing a more R-ish way of doing something, and the like. – Aaron left Stack Overflow May 29 '11 at 12:41

6 Answers6

10

My computer is obviously much slower, but structure is a pretty fast way to do this:

> system.time({
+ dat1 <- dat
+ for(x in 1:ncol(dat)) {
+   dat1[,x] <- factor(dat1[,x], labels=re.codes)
+   }
+ })
   user  system elapsed 
 11.965   3.172  15.164 
> 
> system.time({
+ m <- as.matrix(dat)
+ dat2 <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
+ })
   user  system elapsed 
  2.100   0.516   2.621 
> 
> system.time(dat3 <- data.frame(lapply(dat, structure, class='factor', levels=re.codes)))
   user  system elapsed 
  0.484   0.332   0.820 

# this isn't because the levels get re-ordered
> all.equal(dat1, dat2)

> all.equal(dat1, dat3)
[1] TRUE
Charles
  • 4,389
  • 2
  • 16
  • 13
10

Combining @DWin's answer, and my answer from Most efficient list to data.frame method?:

system.time({
  dat3 <- list()
  # define attributes once outside of loop
  attrib <- list(class="factor", levels=re.codes)
  for (i in names(dat)) {              # loop over each column in 'dat'
    dat3[[i]] <- as.integer(dat[[i]])  # convert column to integer
    attributes(dat3[[i]]) <- attrib    # assign factor attributes
  }
  # convert 'dat3' into a data.frame. We can do it like this because:
  # 1) we know 'dat' and 'dat3' have the same number of rows and columns
  # 2) we want 'dat3' to have the same colnames as 'dat'
  # 3) we don't care if 'dat3' has different rownames than 'dat'
  attributes(dat3) <- list(row.names=c(NA_integer_,nrow(dat)),
    class="data.frame", names=names(dat))
})
identical(dat2, dat3)  # 'dat2' is from @Dwin's answer
Community
  • 1
  • 1
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • +1 system.time() = 0.08. What just happened? I'd really appreciate a detailed explanation of this one Josh. – Brandon Bertelsen May 27 '11 at 14:34
  • 1
    The only work is does is to convert the data to integers; beyond that all it does is add attributes for each column and to the whole to make the columns into factors and the whole into a data.frame. It's faster because it doesn't do any of the usual checks to make sure that the resulting factors and data.frame are sensible. – Aaron left Stack Overflow May 27 '11 at 15:03
  • @Brandon: @Aaron is spot-on. The `as.integer` call is slightly faster than @DWin's `storage.mode` approach. The rest of the gains come from skipping all the checks, which assumes the original `dat` is a sensible data.frame. – Joshua Ulrich May 27 '11 at 15:09
  • Could you be more explicit in what you mean by "sensible"? – Brandon Bertelsen May 27 '11 at 15:14
  • 1
    Rather than take my word for it, see the first two paragraphs in the Details section of `?data.frame` (basically, columns have the same number of rows, row names are unique, column names exist and are unique. I assume unique column names even though data.frames are not *required* to have them). – Joshua Ulrich May 27 '11 at 15:19
8

Try this:

m <- as.matrix(dat)

dat <- data.frame( matrix( re.codes[m], nrow = nrow(m)))
Prasad Chalasani
  • 19,912
  • 7
  • 51
  • 73
7

A data.table answer for your consideration. We're just using setattr() from it, which works on data.frame, and columns of data.frame. No need to convert to data.table.

The test data again :

dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000)) 
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat) 
dat <- as.data.frame(dat) 
re.codes <- c("This","That","And","The","Other") 

Now change the class and set the levels of each column directly, by reference :

require(data.table)
system.time(for (i in 1:ncol(dat)) {
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
}
# user  system elapsed 
#   0       0       0 

identical(dat, <result in question>)
# [1] TRUE

Does 0.00 win? As you increase the size of the data, this method stays at 0.00.

Ok, I admit, I changed the input data slightly to be integer for all columns (the question has double input data in a third of the columns). Those double columns have to be converted to integer because factor is only valid for integer vectors. As mentioned in the other answers.

So, strictly with the input data in the question, and including the double to integer conversion :

dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1,2,4,5,3),50000))             
dat <- cbind(dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat,dat)               
dat <- as.data.frame(dat)               
re.codes <- c("This","That","And","The","Other")           

system.time(for (i in 1:ncol(dat)) {
  if (!is.integer(dat[[i]]))
      set(dat,j=i,value=as.integer(dat[[i]]))
  setattr(dat[[i]],"levels",re.codes)
  setattr(dat[[i]],"class","factor")
})
#  user  system elapsed
#  0.06    0.01    0.08      # on my slow netbook

identical(dat, <result in question>)
# [1] TRUE

Note that set also works on data.frame, too. You don't have to convert to data.table to use it.

These are very small times, clearly. Since it's only a small input dataset :

dim(dat)
# [1] 250000     36 
object.size(dat)
# 68.7 Mb

Scaling up from this should reveal larger differences. But even so I think it should be (just about) measurably fastest. Not a significant difference that anyone minds about, at this size, though.

The setattr function is also in the bit package, btw. So the 0.00 method can be done with either data.table or bit. To do the type conversion by reference (if required) either set or := (both in data.table) is needed, afaik.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • +1 very cool. I guess technically, all that's really happening here is that you're storing a few values of text for each column. Hence the speed. I wonder if this could be done without the use of data.table or bit packages, just by setting attributes. – Brandon Bertelsen Sep 25 '12 at 15:49
  • 1
    @BrandonBertelsen Exactly. It's just changing the attributes by reference. Afaik, no, not possible without either data.table or bit. The reason, I think, is that it breaks standard R practice. Say you have `dat2<-dat` beforehand. `setattr` will change both `dat2` and `dat`. Base R methods all copy at least some of the memory, at least once, and sometimes all of it many times, to uphold copy-on-write. Even when there is only one `dat` and there is no need to copy it at all. `setattr` and `set`, when used on a `data.frame`, could be considered _dangerous_ by some, for this reason. – Matt Dowle Sep 25 '12 at 16:10
  • @BrandonBertelsen `setattr` is nothing other than a wrapper to R's `setAttrib` function at C level. So you can `.Call` (or similar) to that, too, directly yourself, but not possible in base R (afaik) is what I meant. Copies were reduced in recent versions of R, but not down to zero. – Matt Dowle Sep 25 '12 at 16:23
6

The help page for class() says that class<- is deprecated and to use as. methods. I haven't quite figured out why the earlier effort was reporting 0 observations when the data was obviously in the object, but this method results in a complete object:

    system.time({ dat2 <- vector(mode="list", length(dat))
      for (i in 1:length(dat) ){ dat2[[i]] <- dat[[i]]
        storage.mode(dat2[[i]]) <- "integer"
               attributes(dat2[[i]]) <- list(class="factor", levels=re.codes)}
  names(dat2) <- names(dat)
  dat2 <- as.data.frame(dat2)})
#--------------------------  
  user  system elapsed 
  0.266   0.290   0.560 
> str(dat2)
'data.frame':   250000 obs. of  36 variables:
 $ V1 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V2 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ V3 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
 $ V4 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V5 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 $ V6 : Factor w/ 5 levels "This","That",..: 1 2 4 5 3 1 2 4 5 3 ...
 $ V7 : Factor w/ 5 levels "This","That",..: 1 2 3 4 5 1 2 3 4 5 ...
 $ V8 : Factor w/ 5 levels "This","That",..: 5 4 3 2 1 5 4 3 2 1 ...
 snipped

All 36 columns are there.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I'd love to know why this is so much faster than Charles's solution (lapply + structure), at least on my machine (1.055sec vs 0.34sec). – joran May 27 '11 at 05:11
  • 1
    This returns an empty data frame for me. It does do it quickly though! – Brandon Bertelsen May 27 '11 at 05:21
  • @Brandon Please check your results. I just reran it and posted results. – IRTFM May 27 '11 at 11:50
  • @DWin From your answer: `'data.frame': 0 obs. of 36 variables`. It's empty. `class(dat2) <- "data.frame"` is causing this. With `dat2<-as.data.frame(dat2)` it works (slower but still faster then Charles` – Marek May 27 '11 at 12:36
  • @joran: I think the reason it's faster is that I am not reprocessing the numeric vectors. I'm just working "outside" them on their attributes. Now that I think about it, I wonder if I can just do that on the original and skip the copying step? – IRTFM May 27 '11 at 13:08
  • 1
    @Dwin, @joran As I see there is no difference in timings when use: `system.time(as.data.frame(lapply(dat, structure, class='factor', levels=re.codes)))` – Marek May 27 '11 at 13:31
  • +1 system.time() = 0.4 seconds I have no idea what you did here but it's hella fast (and now works as expected)! – Brandon Bertelsen May 27 '11 at 14:28
3

Making factors is expensive; only doing it once is comparable with the commands using structure, and in my opinion, preferable as you don't have to depend on how factors happen to be constructed.

rc <- factor(re.codes, levels=re.codes)
dat5 <- as.data.frame(lapply(dat, function(d) rc[d]))

EDIT 2: Interestingly, this seems to be a case where lapply does speed things up. This for loop is substantially slower.

for(i in seq_along(dat)) {
  dat[[i]] <- rc[dat[[i]]]
}

EDIT 1: You can also speed things up by being more precise with your types. Try any of the solutions (but especially your original one) creating your data as integers, as follows. For details, see a previous answer of mine here.

dat <- cbind(rep(1:5,50000),rep(5:1,50000),rep(c(1L,2L,4L,5L,3L),50000))

This is also a good idea as converting to integers from floating points, as is being done in all of the faster solutions here, can give unexpected behavior, see this question.

Community
  • 1
  • 1
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • Type modification in original loop to `for(x in 1:ncol(dat)) dat[,x] <- factor(as.integer(dat[,x]), labels=re.codes)` speed up execution significantly. – Marek May 27 '11 at 15:49
  • Nice suggestion, Marek, and good clarification that the speedup comes much more from running `factor` on an integer rather than from starting with integers. – Aaron left Stack Overflow May 27 '11 at 15:57