1

I have the following:

An environment that is working like a hash for rows in a data frame. For example, the environment "inc" has key "hello" and get("hello", envir = inc) will return "row1" "row2" "row50" where these are names to the rows in a data frame. After I select these rows for a key in the environment, I want perform colSums on them.

The environment has roughly 400,000 entries and I want create a new data frame with 400,000 rows based on these colSums. I have working code that basically uses lapply/foreach to do this and I've used it on a small subset of the data... but it is INCREDIBLY slow. As in... it's been running for 20 minutes on 3 cores using doMC and it's still not done. Here is the code:

incCounts <- foreach(key = ls(inc)) %dopar% {
       transNames <- get(key, envir = inc)
       transCounts <- df[transNames, ]
       if ( ! is.null(dim(transCounts)) )
           transCounts <- colSums(transCounts)
       return(transCounts)
}
incCounts <- as.data.frame(t(simplify2array(incCounts)))

EDIT: Here is an example of what I'm trying to do with a data.frame and data.table:

library(data.table)
set.seed(20)
transEnv <- new.env(hash = TRUE)
assign("hash1", paste("trans", 2:4, sep = ""), envir = transEnv)
assign("hash2", paste("trans", c(1, 3), sep = ""), envir = transEnv)

df <- data.frame(matrix(rnorm(5 * 4), nrow = 4, ncol = 5))
rownames(df) <- paste("trans", 1:4, sep = "")
colSums(df[transEnv$hash1, ]) # what I want
       X1         X2         X3         X4         X5 
0.9476963 -3.2149230  0.7603257 -1.8494967  1.7569055

dt <- data.table(trans = rownames(df), df)
setkey(dt, trans)

# This isn't working as I expected... 
dt[transEnv$hash1, list(sum(X1), sum(X2), sum(X3), sum(X4), sum(X5))]

      trans         V1         V2         V3        V4         V5
[1,] trans2 -0.1444402 -1.4720633 -0.6135086  1.108451 1.24556891
[2,] trans3  0.7222297 -0.5961595 -0.2163115 -1.097342 0.08785472
[3,] trans4  0.3699069 -1.1467001  1.5901458 -1.860606 0.42348190

Any help would be greatly appreciated! Thanks!

Harold
  • 293
  • 3
  • 10
  • are you running out of memory? also, have you tried a different parallel backend, say mclapply, instead of foreach? – ALiX Jun 13 '12 at 00:10

1 Answers1

3

May be appropriate for data.table. See wiki point 5, and this answer. If you do try it, start with vignette('datatable-intro').

DT[,lapply(.SD,sum),by=grp]



To answer the edit, the relevant section of ?data.table is :

Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in i. When i is a data.table, DT[i,j] evaluates j for each row of i. We call this by without by, or, grouping by i.

So instead of

dt[transEnv$hash1, list(sum(X1), sum(X2), sum(X3), sum(X4), sum(X5))]

try :

dt[transEnv$hash1, list(sum(X1),sum(X2),sum(X3),sum(X4),sum(X5)), mult="last"]

or,

dt[transEnv$hash1][, list(sum(X1), sum(X2), sum(X3), sum(X4), sum(X5))]

or,

dt[transEnv$hash1,lapply(.SD,sum),by="",.SDcols=names(dt)[-1]]

or,

dt[transEnv$hash1][,trans:=NULL][,sapply(.SD,sum)] 
Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • Hi Matthew. Thank you for this. So, I've created a data.table and sorted it by key "transNames". Now, I need a colSum over all columns EXCEPT 'transNames'. I can't actually use "by" because by construction 'transNames' is unique and I have too many different combinations of rows to make it feasible to use "by". I simply want fast access to the data.frame rows. I've tried doing something like `DT[c("trans1", "trans2"), list(sum(V1), sum(V2))]` but this just returns the DT with rows "trans1" and "trans2". Suggestions? Thanks! – Harold Jun 18 '12 at 19:39
  • @Harold I don't follow I'm afraid. Can you provide a small example dataset to illustrate input and required output please. – Matt Dowle Jun 18 '12 at 20:20