4

In R, I want to create a factor with only a few levels, but with a length of almost 100 million. The "normal" way for me to create a factor is to call factor on a character vector, but I expect this method to be very inefficient. What is the proper way to construct a long factor without fully expanding the corresponding character vector.

Here is an example of the wrong way to do it: creating and then factoring a character vector:

long.char.vector = sample(c("left", "middle", "right"), replace=TRUE, 50000000)
long.factor = factor(long.char.vector)

How can I construct long.factor without first constructing long.char.vector? Yes, I know those two lines of code can be combined, but the resulting line of code still creates the gigantic char vector anyway.

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
  • What level of efficiency do you really need? I was able to run the code above with length of 100 million in 12 seconds; 200 million in 25 seconds (2.8 Ghz i5 iMac). Sure, it took a bunch of RAM to do that, but as they say: "RAM is cheap. Thinking is expensive." – Noah Apr 11 '11 at 20:53
  • What do you need the factor for? – John Apr 11 '11 at 21:07
  • I'm processing a large dataset of DNA sequences. For each sequence, I'm either trimming a prefix, trimming a suffix, not trimming anything, or discarding the read entirely, and I want to create a factor with levels `c("left", "right", "all", and "none")` to record what action I took for each sequence. – Ryan C. Thompson Apr 11 '11 at 21:35

2 Answers2

8

It's not going to be much more efficient, but you can sample a factor vector:

big.factor <- sample(factor(c("left", "middle", "right")), replace=TRUE, 5e7)
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • +1 It actually is a whole lot more efficient. about a tenfold on my computer. So first make the factor, then sample, very good to know. That's what's going wrong when using gl() as well. – Joris Meys Apr 11 '11 at 21:40
  • Actually this seems to be much more efficient! More than twice as fast on my computer at least, and only slightly slower than sampling an integer vector and wrapping it with `structure(..., class='factor', levels=...)`. – Charles Apr 11 '11 at 21:42
  • 5
    +1 indeed. A similar version is `factor(c("left", "middle", "right"))[sample(3, 5e7, replace=TRUE)]`, i.e. generate the long factor by taking a short factor and repeatedly index into it. – Gavin Simpson Apr 11 '11 at 21:53
3

You could construct factor from scratch:

long.factor <- sample(seq.int(3), replace=TRUE, 50000000)
levels(long.factor) <- c("left", "middle", "right")
class(long.factor) <- "factor"
Marek
  • 49,472
  • 15
  • 99
  • 121