9

I have a large data table in R:

library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
  ID=sample(1:200000, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category=sample(1:1000, n, replace=TRUE),
  Qty=runif(n)*500,
  key=c('ID', 'Month')
)
dim(DT)

I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.

Any ideas how I might do this?

/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:

agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.

Community
  • 1
  • 1
Zach
  • 29,791
  • 35
  • 142
  • 201
  • 1
    Do you mean to fill the body of the table with `Qty`? Summing any duplicate combinations? – joran Apr 04 '13 at 22:19
  • @joran: in my example there are duplicate combinations, but for the sake of argument, lets assume there are not. What I want is a distinct column for each value of the Category field, with either NA, or 0 for missing combinations. – Zach Apr 04 '13 at 22:23
  • @joran I think the correct answer to your question is yes: I want Category to become a column, with Qty in each column, NAs or 0s for missing Categories, and duplicates should be summed (but it's fair to do the summing before we reshape). – Zach Apr 05 '13 at 14:30

4 Answers4

10

data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.

Note that we don't need to load reshape2 package.

library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
  ID=sample(1:200000, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
  Qty=runif(n),
  key=c('ID', 'Month')
)
dim(DT)

> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
#   user  system elapsed
# 65.924  20.577  86.987
> dim(ans)
# [1] 2399401     802
Arun
  • 116,683
  • 26
  • 284
  • 387
3

Like that?

agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
        timevar = "Category", direction = "wide")
flodel
  • 87,577
  • 21
  • 185
  • 223
3

There is no data.table specific wide reshaping method.

Here is an approach that will work, but it is rather convaluted.

There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.

Here is a simple example

# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N,  by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
                     # safe version of correct NA  
                     z <- cc[1]
                      is.na(z) <-1
                     # using rows value calculated previously
                     # to ensure correct size
                       rep(z,rows)}, 
                    simplify = FALSE), 'names', nn),
           keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be 
# quick (the feature request would make this possible to 
# speed up using binary search joins.
for(ii in ub){
  for(jj in nn){
    DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
  }
}

DDw
#    b  d e  f
# 1: a  1 2  3
# 2: a  4 2  3
# 3: b NA 5 NA
# 4: b NA 5 NA
mnel
  • 113,303
  • 27
  • 265
  • 254
2

EDIT

I found this SO post, which includes a better way to insert the missing rows into a data.table. Function fun_DT adjusted accordingly. Code is cleaner now; I don't see any speed improvements though.

See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).

I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.

The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)

library(data.table)
library(rbenchmark)

fun_reshape <- function(n) {

  DT <- data.table(
    ID=sample(1:100, n, replace=TRUE), 
    Month=sample(1:12, n, replace=TRUE),
    Category=sample(1:10, n, replace=TRUE),
    Qty=runif(n)*500,
    key=c('ID', 'Month')
  )
  agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
  reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
          timevar = "Category", direction = "wide")
}

#UPDATED!
fun_DT <- function(n) {

  DT <- data.table(
    ID=sample(1:100, n, replace=TRUE), 
    Month=sample(1:12, n, replace=TRUE),
    Category=sample(1:10, n, replace=TRUE),
    Qty=runif(n)*500,
    key=c('ID', 'Month')
  ) 

  agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
  agg[, ID2 := paste(ID, Month, sep="_")]

  setkey(agg, ID2, Category)
  agg <- agg[CJ(unique(ID2), unique(Category))]

  agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]

}

library(rbenchmark)

n <- 1e+07
benchmark(replications=10,
          fun_reshape(n),
          fun_DT(n))
            test replications elapsed relative user.self sys.self user.child sys.child
2      fun_DT(n)           10  45.868        1    43.154    2.524          0         0
1 fun_reshape(n)           10  45.874        1    42.783    2.896          0         0
Community
  • 1
  • 1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • I'll try these both with 200,000 IDs and 1,000 categories and let you know how it goes. I suspect `fun_DT` is going to blow up, but `fun_reshape` might work. – Zach Apr 05 '13 at 14:26
  • @Zach Let me know, that would be interesting. Why do you think though that `run_DT` is going to blow up? I think those additional fields have to be created anyway, one way or the other, so I wouldn't expect that. Hope I got it right. Also, see my update. Code is cleaner now. – Christoph_J Apr 06 '13 at 01:22
  • 200,000 IDs * 12 Months * 1,000 Categories = a full data.frame of 2,400,000,000 rows, which is greater than the maximum size data.frame in R (2,147,483,648). – Zach Apr 06 '13 at 01:33
  • Yeah, but that doesn't really change when you put the categories in columns, does it? Basically, you just transform the information, you don't reduce it. The only advantage I see is that you save the identifying columns (i.e., you don't have a column "Category" with 2.4 billion rows which has a lot of redundant information). And maybe one runs into a hardcoded limit (for instance for data.frames as you wrote). – Christoph_J Apr 06 '13 at 03:39
  • I don't know, but since it inherits from them, I guess so. – Christoph_J Apr 09 '13 at 21:35
  • That was my thinking too. So I want to avoid hitting that limit before pivoting. – Zach Apr 09 '13 at 21:40
  • @Christoph_J, maybe it's worthwhile to edit this post with the `dcast.data.table` method? If you don't have the time, I can do it as well. – Arun Mar 13 '14 at 10:13