3

I want like convert a data.table object to the following condition by using table function. That means, colb as colnames, and cola as rownames, and then if the individual col include corresponding row, then put the entry as 1, otherwise 0. I do it like this:

dt <- data.table(cola = c(1, 1, 2, 3), colb = c(10, 20, 30, 40))
dt
table(dt)

> dt
   cola colb
1:    1   10
2:    1   20
3:    2   30
4:    3   40
> table(dt)
colb
cola 10 20 30 40
   1  1  1  0  0
   2  0  0  1  0
   3  0  0  0  1

But when the data set is big, for example 39-million rows by 2 columns in my case, the table operation cost ~80 seconds to finish.

I would like to know is there more efficient method to do the same thing as table function did?

In addition, dcast.data.table(dt, cola ~ colb, fill = 0L) do the same thing when I try it, but the resutls have a little different which should be futher deal with to get the same results as table function. The important thing is that dcast.data.table do not improve the speed when I try my data. So, I hope someone can help to work out more efficient method to do the same thing!

Thank you.

  • Surprising that `dcast()` and `table()` have the same performance.. Never heard that before. In any case, performance issues are hard to answer unless we've access to the data you're working on (or at the least simulated data). Also, please post your `sessionInfo()` output. – Arun Feb 18 '15 at 20:21
  • I was also surprised that `table` operations seemed to be slower in data.table with 7 million rows than they were in dataframes. I'm working with an older Mac SL -build but current version, and data.table 1.9.4. I was doing the table operation as a j-argument inside "[". I had other more complicated functions that I found difficult to port over to data.table and just went back to my old dataframe ways. – IRTFM Feb 18 '15 at 20:35
  • @BondedDust what kind of complicated functions? Maybe post a question and we will try to figure it out. – David Arenburg Feb 18 '15 at 20:45
  • @Arun: > sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.16.1 data.table_1.9.5 loaded via a namespace (and not attached): [1] chron_2.3-45 – BioChemoinformatics Feb 18 '15 at 21:23
  • @BioChemoinformatics, thanks. I've tested it with dummy data on your dimensions. `table()` seems quite efficient here (~50s on mine). The `CJ` step is killing us here (internal implementation also uses `CJ`). Good to know there are improvements possible. Will update when/if I find a way to improve this. – Arun Feb 18 '15 at 21:47
  • @Arun. thanks. Waiting for your improvement. – BioChemoinformatics Feb 18 '15 at 21:52
  • @BioChemoinformatics I'm curious - given how large and sparse the resulting wide table is - what exactly is the point of this operation? – eddi Feb 18 '15 at 21:58
  • @DavidArenburg The complications for me had to do with the fact that I was passing in column names as character values and then constructing multiple tables with `do.call` and `tapply` and "[" and then taking elementwise ratios of matrices. I did give thought to building a MRE and posting . Perhaps I will. – IRTFM Feb 18 '15 at 22:00
  • @BioChemoinformatics, improvements aside, I'm curious too. The wide data is too sparse (your original data is ~300Mb, and after casting it'd result in ~3Gb). – Arun Feb 18 '15 at 22:02
  • @Arun and eddi: Actually, what I want finally is the same: Intersect all possible combinations of list elements http://stackoverflow.com/questions/24614391/intersect-all-possible-combinations-of-list-elements) and then use 'Ananda Mahto' method. – BioChemoinformatics Feb 18 '15 at 22:06
  • @eddi, the question I ask here is one of steps according to Ananda's method. – BioChemoinformatics Feb 18 '15 at 22:14
  • @BioChemoinformatics, I think [this answer](http://stackoverflow.com/a/27771045/559784) using Matrix package, which deals very efficiently with sparse matrices might be the way to go here.. But I'll add this to the list of possible places for enhancements. – Arun Feb 18 '15 at 22:18
  • 2
    For the future, explaining the actual task (briefly) would save everyone's time. – Arun Feb 18 '15 at 22:19
  • @Arun, thanks. I just use 'slam' Package to solve my problem. The question is how can I get such sparse matrix. That is the reason I ask here. I actually had solved my final question, but I just want to improve the speed. – BioChemoinformatics Feb 18 '15 at 22:24
  • Good to know. As for your Q, using the function `sparseMatrix()` as shown in the answer linked? – Arun Feb 18 '15 at 22:39

1 Answers1

2

First, Thanks @Arun and all. Yes, sparseMatrix can solve my original qestions. Here I list the answer (according to Arun's suggestion) . Here is just an demo example I originally hope:

dt <- data.table(sid = c(1, 2, 3, 4, 3, 2, 1, 6, 1, 2), 
                 aid = c(100, 100, 100, 100, 200, 200, 200, 300, 300, 300))
dt
library(Matrix)
sm <- sparseMatrix(dt[, sid], dt[, aid], x = TRUE)   
cp <- t(sm) %*% sm   
cp <- summary(cp)  
cp <- cp[cp$i < cp$j, ]     
as.data.frame(cp)
    i   j x
4 100 200 3
7 100 300 2
8 200 300 2

This method is more efficent than I used before ('Ananda Mahto' method). The time comparison for my data set (39763098 rows and 2 columns): ~141 seconds VS ~40 seconds for my original method and Arun's method respectively. So Thanks, it is perfect. Second, Hope that for my current post's question, data.table can give a improvement too. You will.

Community
  • 1
  • 1