How can I efficiently construct a very long factor with few levels?

Question

In R, I want to create a factor with only a few levels, but with a length of almost 100 million. The "normal" way for me to create a factor is to call factor on a character vector, but I expect this method to be very inefficient. What is the proper way to construct a long factor without fully expanding the corresponding character vector.

Here is an example of the wrong way to do it: creating and then factoring a character vector:

long.char.vector = sample(c("left", "middle", "right"), replace=TRUE, 50000000)
long.factor = factor(long.char.vector)

How can I construct long.factor without first constructing long.char.vector? Yes, I know those two lines of code can be combined, but the resulting line of code still creates the gigantic char vector anyway.

What level of efficiency do you really need? I was able to run the code above with length of 100 million in 12 seconds; 200 million in 25 seconds (2.8 Ghz i5 iMac). Sure, it took a bunch of RAM to do that, but as they say: "RAM is cheap. Thinking is expensive." — Noah, Apr 11 '11 at 20:53
I'm processing a large dataset of DNA sequences. For each sequence, I'm either trimming a prefix, trimming a suffix, not trimming anything, or discarding the read entirely, and I want to create a factor with levels `c("left", "right", "all", and "none")` to record what action I took for each sequence. — Ryan C. Thompson, Apr 11 '11 at 21:35

score 8 · Accepted Answer · answered Apr 11 '11 at 20:48

8

It's not going to be much more efficient, but you can sample a factor vector:

big.factor <- sample(factor(c("left", "middle", "right")), replace=TRUE, 5e7)

answered Apr 11 '11 at 20:48

Joshua Ulrich

173,410
32
338
418

+1 It actually is a whole lot more efficient. about a tenfold on my computer. So first make the factor, then sample, very good to know. That's what's going wrong when using gl() as well. – Joris Meys Apr 11 '11 at 21:40
Actually this seems to be much more efficient! More than twice as fast on my computer at least, and only slightly slower than sampling an integer vector and wrapping it with `structure(..., class='factor', levels=...)`. – Charles Apr 11 '11 at 21:42
5

+1 indeed. A similar version is `factor(c("left", "middle", "right"))[sample(3, 5e7, replace=TRUE)]`, i.e. generate the long factor by taking a short factor and repeatedly index into it. – Gavin Simpson Apr 11 '11 at 21:53

score 3 · Answer 2 · answered Apr 11 '11 at 21:46

3

You could construct factor from scratch:

long.factor <- sample(seq.int(3), replace=TRUE, 50000000)
levels(long.factor) <- c("left", "middle", "right")
class(long.factor) <- "factor"

answered Apr 11 '11 at 21:46

Marek

49,472
15
99
121

How can I efficiently construct a very long factor with few levels?

2 Answers2

Linked