I have an old problem made challenging by the size of the data set. The problem is to transform a data frame from long to a wide matrix:
set.seed(314)
A <- data.frame(field1 = sample(letters, 10, replace=FALSE),
field2 = sample(toupper(letters), 10, replace=FALSE),
value=1:10)
B <- with(A, tapply(value, list(field1, field2), sum))
This can also be done with the old reshape in base R or, better in plyr and reshape2. In plyr:
daply(A, .(field1, field2), sum)
In reshape2:
dcast(A, field1 ~ field2, sum)
The problem is that I the data frame has 30+m rows, with at least 5000 unique values for field1 and 20000 for field2. With this size, plyr crashes, reshape2 occasionally crashes, and tapply is very slow. The machine is not a constraint (48GB, <50% utilization, and 8 core Xeon). What is the best practice for this task?
N.B.: This question is not a duplicate. I explicitly mention that the output should be a wide array. The answer referenced as a duplicate references the use of dcast.data.table, which returns a data.table. Casting a data.table to an array is a very expensive operation.