Following up some data.table
parallelism (1) (2) (3) I'm trying to figure it out. What's wrong with this syntax?
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
dt2 <- dt[, foo(.SD), by= "id"]
library(parallel)
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
# note that library(parallel) is annoying and you often have to do this type ("::", ":::") of exporting to the parallel package
Error in checkForRemoteErrors(val) : 4 nodes produced errors; first error: incorrect number of dimensions
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x <- data.table::data.table(x)
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
Error in checkForRemoteErrors(val) : 4 nodes produced errors; first error: object 'id' not found
I've played around with the syntax quite a bit. These two seem to be the closest I can get. And obviously something's still not right.
My real problem is similarly structured but has many more rows and I'm using a machine with 24 cores / 48 logical processors. So watching my computer use roughly 4% of it's computing power (by using only 1 core) is really annoying