4

I have an old problem made challenging by the size of the data set. The problem is to transform a data frame from long to a wide matrix:

set.seed(314)
A <- data.frame(field1 = sample(letters, 10, replace=FALSE), 
    field2 = sample(toupper(letters), 10, replace=FALSE), 
    value=1:10)

B <- with(A, tapply(value, list(field1, field2), sum))

This can also be done with the old reshape in base R or, better in plyr and reshape2. In plyr:

daply(A, .(field1, field2), sum)

In reshape2:

dcast(A, field1 ~ field2, sum)

The problem is that I the data frame has 30+m rows, with at least 5000 unique values for field1 and 20000 for field2. With this size, plyr crashes, reshape2 occasionally crashes, and tapply is very slow. The machine is not a constraint (48GB, <50% utilization, and 8 core Xeon). What is the best practice for this task?

N.B.: This question is not a duplicate. I explicitly mention that the output should be a wide array. The answer referenced as a duplicate references the use of dcast.data.table, which returns a data.table. Casting a data.table to an array is a very expensive operation.

gappy
  • 10,095
  • 14
  • 54
  • 73
  • Have you seen: http://stackoverflow.com/questions/18802257/efficient-ways-to-reshape-huge-data-from-long-to-wide-format-similar-to-dcast or http://stackoverflow.com/questions/6902087/proper-fastest-way-to-reshape-a-data-table ? – thelatemail May 16 '14 at 02:43
  • Hope my "gold-hammer" duplicate marking was not premature.... – mnel May 16 '14 at 03:07
  • +mnel, see comment above. This not a duplicate question. – gappy May 16 '14 at 20:12

2 Answers2

3

FWIW, here's the solution that partly uses data.table (for aggregation alone).

(Edit: I replaced the dcast.data.table answer with the answer from @BenBolker as it avoids that step completely, and is both memory and speed efficient - Please check revision if you're looking for that solution).

Create some sample data (with similar specifications):

require(data.table) ## >= 1.9.2
set.seed(1L)
N = 30e6L
DT <- data.table(field1 = sample(paste0("F1_", 1:5000), N, TRUE), 
                 field2 = sample(paste0("F2_", 1:20000), N, TRUE),
                 value  = sample(10))

> tables()
#      NAME       NROW  MB COLS                KEY
# [1,] DT   30,000,000 574 field1,field2,value
# Total: 574MB

Aggregate:

system.time(ans <- DT[, list(value=sum(value)), by=list(field1, field2)])
#   user  system elapsed
# 15.097   3.357  18.454

(The edited answer:) Then you can use @BenBolker's (clever) solution as follows (which completely voids the need for casting):

system.time({
    rlabs <- sort(unique(ans$field1))
    clabs <- sort(unique(ans$field2))
    fans <- matrix(NA,length(rlabs),length(clabs),
              dimnames=list(rlabs,clabs))
    fans[as.matrix(ans[,1:2, with=FALSE])] <- ans$value
})
#   user  system elapsed
# 18.630   1.524  20.154
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Thanks, I will profile. Currently I do a dcast.data.table followed by a data.matrix (and some minor transformations). This should be faster. – gappy May 17 '14 at 03:08
2

I haven't checked the speed, but is this low-level approach any better? (Set up a matrix full of NAs with the appropriate margins and fill in using 2-column matrix indexing ...)

rlabs <- sort(unique(A$field1))
clabs <- sort(unique(A$field2))
B <- matrix(NA,length(rlabs),length(clabs),
      dimnames=list(rlabs,clabs))
B[as.matrix(A[,1:2])] <- A[,3]

It would be really nice if you could set this up as a sparse matrix, but I assume that you have zero values in your value column ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453