23

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking

Starting with a simple example,

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30), 
                  col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))

> z
     col1 col2  col3 col4
1:    1   10   5.0    a      # col1 = 1, col2 = 10
2:    1   10   2.0    a      # col1 = 1, col2 = 10
3:    2   20   2.3    b
4:    3   20   2.4    c
5:    4   30 100.0    a

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.

> ulist = unique(z$col4) # These will be the additional column names

# Create long table with sum
> z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']

# Pivot the long table
> z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]

# Add column names
> setnames(z2[],c("col1","col2",ulist))

> z2
   col1 col2   a   b   c
1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10
2:    2   20  NA 2.3  NA
3:    3   20  NA  NA 2.4
4:    4   30 100  NA  NA

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.

My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.

n = 100 # Increase this to benchmark

z <- data.table(c1=sample(1:10000,n,replace=T),
    c2=sample(1:100000,n,replace=T),
    c3=sample(1:25,n,replace=T),
    price=runif(n)*10)

c3.unique <- 1:25

z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2']
setnames(z[], c("c1","c2",c3.unique))

Thanks,

  • Raj.
Arun
  • 116,683
  • 26
  • 284
  • 387
xbsd
  • 2,438
  • 4
  • 25
  • 35
  • Just as a question, what is your purpose for making data that would produce that many NA's wide? It's the kind of data long format is perfect for. – John Sep 14 '13 at 14:05
  • In market research, it is sometimes necessary to create wide tables for use by quant/research groups. It's not necessarily a format everyone prefers, but there are definitely certain disciplines where these are required for end users. The cells can be blank (no NAs) if necessary, having the NAs doesn't add any information anyhow. – xbsd Sep 14 '13 at 14:11
  • Why does it have to be an R solution if you are able to get much better timings using other methods. If you're using this to generate tables for others to use, do they really care how you got the data into the required form? – A5C1D2H2I1M1N2O1R2T1 Sep 14 '13 at 14:45
  • 1
    Can you include your KDB code you're benchmarking? Btw, @Arun is already working on a fast dcast in C. – Matt Dowle Sep 14 '13 at 14:55
  • R is a good choice for the end users as it offers numerous statistical functions in addition to data manipulation, whereas a db system requires SQL and in case of KDB, the Q language (based on A programming language, too cryptic even for a regular programmer). This is just one use case, and it soon becomes cumbersome for IT to do every manipulation for all analysts. ... I don't need the performance to match "8 seconds", but a reasonable time frame in order of minutes would be good enough. – xbsd Sep 14 '13 at 14:57
  • Well, shortening it to one call cuts it roughly in half on my computer: `names(ulist) <- ulist; z[,lapply(ulist,function(x)sum(col3[col4==x])),by='col1,col2']` – Frank Sep 14 '13 at 15:03
  • Hi Matthew,all, Q Code here - [KDB Q Code for Pivot Example](http://pastebin.com/1TuBGY9e). The time in the code is 11865 milliseconds. For everyone's information - I am using the 32-bit trial version of KDB for this which is available [here](http://kx.com/software-download.php). There is nothing to install, just unzip in the q zip file home directory (under ~/) or under C:\ and run. You can also enable http port by typing \p 5000 on the q prompt and accessing it via kdbstudio or `http://yourhostname:5000?` on a browser. – xbsd Sep 14 '13 at 15:14
  • FYI - Timing in the code is 11865 milliseconds on my laptop. Just to add, in production, we would generally run the code multicore (using the Q peach function or similar) on 64-bit Q, so it is a lot faster. R is typically 3x+ of the time. So if the Q time for a large is ~ 1 hour, R would be ~ 5-6 hrs. If there would be a way to parallelize such an operation, it could be a more comparable comparison. – xbsd Sep 14 '13 at 15:43
  • 1
    @xbsd, yes as @MatthewDowle mentioned, I'm trying to implement `dcast` in C. I'll update the post with benchmarks once it's done. – Arun Sep 14 '13 at 18:12
  • @Arun, thanks. I see many benchmarks that are posted on SO, but few that have benchmarks against large datasets (so called "big data"). I regularly work with datasets that are several GBs in size. It'd be great if you could benchmark the new dcast() against some production big data grade datasets. Such benchmarks go a long way in convincing management to transition to R. Oftentimes, you can use KDB+ to benchmark and see how well it compares with the equivalent in R. Anything close to the performance of KDB+ will carry a lot of value in financial (and similar) fields. – xbsd Sep 15 '13 at 02:33
  • @xbsd, question: in your actual dataset (with 500 million rows and 5 million rows), how many unique values do you have in "c1" and "c2"? I guess "c3" is around 20... – Arun Sep 15 '13 at 06:43
  • @Arun, I'll check on this and get back. c3 is 25. c1 and c2 is a lot more. – xbsd Sep 15 '13 at 13:10
  • @xbsd, I'd be much interested in the data for which your benchmark resulted in 11.9 seconds (using KDB Q)... for benchmarking. – Arun Sep 15 '13 at 16:52
  • 2
    @Arun -- File: [t.csv.bz2](http://www.filedropper.com/tcsv) Code: [R and Q Code](http://pastebin.com/WELEjii7) R Time: 27.759 seconds, Q Time: 11.543 seconds. Should add -- these operations can be parallelized in Q using different methods. I have tried to do similar operations using R, generally with packages parallel, multicore, doMC, etc … and have also tried chunking data to sequentially process segments of the dataset in a map reduce fashion. These have almost "always" resulted in seg faults, unfortunately. The dataset sizes in these tests were ~ in the order of 35 GB+. – xbsd Sep 15 '13 at 18:04
  • 2
    @xbsd, Okay, I just benchmarked on the current C version of cast... Good news! Your data runs in 12 seconds on the C-version against your data.table code which runs in 40 seconds on my (slow) laptop. So, basically it seems like a 3.3x speedup (roughly). Hopefully, that'll turn out faster than the KDB approach. And hopefully, I'll be able to optimise it further... I'll try to wrap things up soon and commit and then post/update an answer. – Arun Sep 15 '13 at 18:36
  • 1
    @Arun ... wow, what can I say ! :) You're going to make a lot of people really happy (and the guys at Revolution R not as much ...) with that kind of performance. Let me know once you have it in svn. Would like to try it out against the original datasets. – xbsd Sep 15 '13 at 18:40

1 Answers1

5

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

library(reshape2)

dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)

# or with 1.8.11
dcast.data.table(z, c1 + c2 ~ c3, fun = sum)
eddi
  • 49,088
  • 6
  • 104
  • 155
  • @Arun I don't understand your point 1 - why would it take any time to "convert" a `data.table` to a `data.frame`? – eddi Dec 04 '13 at 23:33
  • re point 2 - you're right, it's about 20% faster (so I edited that version instead) – eddi Dec 04 '13 at 23:36