11

I love the data.table package in R, and I think it could help me perform sophisticated cross tabulation tasks, but haven't figured out how to use the package to do tasks similar to table.

Here's some replication survey data:

opinion <- c("gov", "market", "gov", "gov")
ID <- c("resp1", "resp2", "resp3", "resp4")
party <- c("GOP", "GOP", "democrat", "GOP")

df <- data.frame(ID, opinion, party)

In tables, counting the number of opinions by party is as simple as table(df$opinion, df$party).

I've managed to do something similar in data.table, but the result is clunky and it adds a separate column.

dt <- data.table(df)
dt[, .N, by="party"]

There's a number of grouping operations in data.table that could be great for fast and sophisticated crosstabs of survey data, but i haven't found any tutorials on how to it. Thanks for any help.

tom
  • 977
  • 3
  • 14
  • 30

1 Answers1

14

We can use dcast from data.table (See the Efficient reshaping using data.tables vignette on the project wiki or on the CRAN project page).

dcast(dt, opinion~party, value.var='ID', length)

Benchmarks

If we use a slightly bigger dataset and compare the speed using dcast from reshape2 and data.table

set.seed(24)
df <- data.frame(ID=1:1e6, opinion=sample(letters, 1e6, replace=TRUE),
  party= sample(1:9, 1e6, replace=TRUE))
system.time(dcast(df, opinion ~ party, value.var='ID', length))
#   user  system elapsed 
#  0.278   0.013   0.293 
system.time(dcast(setDT(df), opinion ~ party, value.var='ID', length))
#   user  system elapsed 
# 0.022   0.000   0.023 

system.time(setDT(df)[, .N, by = .(opinion, party)])
#  user  system elapsed 
# 0.018   0.001   0.018 

The third option is slightly better but it is in 'long' format. If the OP wants to have a 'wide' format, the data.table dcast can be used.

NOTE: I am using the the devel version i.e. v1.9.7, but the CRAN should be fast enough.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • this works, but it basically nullifies the the data.table package itself. most of the benefits from data.table from from the subsetting and grouping operations, like "by." This essentially treats the dt as a data.frame. Am i missing something, or is the data.table package just not good for survey data? – tom Oct 04 '15 at 16:17
  • 6
    @tom No, it is not treating it as `data.frame`. The `dcast` is from the `data.table` package and not from the `reshape2` and is optimized for speed. – akrun Oct 04 '15 at 16:18
  • @tom Updated with benchmarks. – akrun Oct 04 '15 at 16:23
  • @akrun are you using 1.9.7? I'm wondering if you're getting the boost from the solution to [FR #1251](https://github.com/Rdatatable/data.table/issues/1251) (which, if I understand, isn't in current CRAN release, 1.9.6); if so this should be mentioned. I remember `dt[,table(x,y)]` being competitive with `dt[,.N,by=.(x,y)]` but that's no longer the case with the update. – MichaelChirico Oct 07 '15 at 17:56
  • @MichaelChirico Yes, I am using 1.9.7. But, the CRAN version should stilll be faster than reshape2 dcast. – akrun Oct 07 '15 at 18:00
  • @akrun I believe so, but `dt[,table(x,y)]` is another straightforward option (faster than `reshape2::dcast` on my machine but slower than `data.table::dcast`) – MichaelChirico Oct 07 '15 at 18:03
  • @MichaelChirico Would that be memory efficient? – akrun Oct 07 '15 at 18:04
  • That I'm not sure. I do use `table` when speed isn't priority 1 because it's so much more readable. But the `.N` approach is better suited to generating 3- or n-way tables and, as you demonstrated, much faster. – MichaelChirico Oct 07 '15 at 18:47