To parallelize a task, I need to split a big data.table to roughly equal parts,
keeping together groups deinfed by a column, id
. Suppose:
N
is the length of the data
k
is the number of distinct values of id
M
is the number of desired parts
The idea is that M << k << N, so splitting by id
is no good.
library(data.table)
library(dplyr)
set.seed(1)
N <- 16 # in application N is very large
k <- 6 # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
arrange(id)
t(dt$id)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a" "b" "b" "b" "b" "c" "c" "c" "d" "d" "d" "e" "e" "f" "f" "f"
in this example, the desired split for M=3
is {{a,b}, {c,d}, {e,f}}
and for M=4
is {{a,b}, {c}, {d,e}, {f}}
More generally, if id were numeric, the cutoff points should be
quantile(id, probs=seq(0, 1, length.out = M+1), type=1)
or some similar split to roughly-equal parts.
What is an efficient way to do this?