6

To parallelize a task, I need to split a big data.table to roughly equal parts, keeping together groups deinfed by a column, id. Suppose:

N is the length of the data

k is the number of distinct values of id

M is the number of desired parts

The idea is that M << k << N, so splitting by id is no good.

library(data.table)
library(dplyr)

set.seed(1)
N <- 16 # in application N is very large
k <- 6  # in application k << N
dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)
t(dt$id)

#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
# [1,] "a"  "b"  "b"  "b"  "b"  "c"  "c"  "c"  "d"  "d"   "d"   "e"   "e"   "f"   "f"   "f"  

in this example, the desired split for M=3 is {{a,b}, {c,d}, {e,f}} and for M=4 is {{a,b}, {c}, {d,e}, {f}}

More generally, if id were numeric, the cutoff points should be
quantile(id, probs=seq(0, 1, length.out = M+1), type=1) or some similar split to roughly-equal parts.

What is an efficient way to do this?

Jaap
  • 81,064
  • 34
  • 182
  • 193
dzeltzer
  • 990
  • 8
  • 28

4 Answers4

6

If distribution of the ids is not pathologically skewed the simplest approach would be simply something like this:

split(dt, as.numeric(as.factor(dt$id)) %% M)

It assigns id to the the bucket using factor-value mod number-of buckets.

For most applications it is just good enough to get a relatively balanced distribution of data. You should be careful with input like time series though. In such a case you can simply enforce random order of levels when you create factor. Choosing a prime number for M is a more robust approach but most likely less practical.

zero323
  • 322,348
  • 103
  • 959
  • 935
6

Preliminary comment

I recommend reading what the main author of data.table has to say about parallelization with it.

I don't know how familiar you are with data.table, but you may have overlooked its by argument...? Quoting @eddi's comment from below...

Instead of literally splitting up the data - create a new "parallel.id" column, and then call

dt[, parallel_operation(.SD), by = parallel.id] 

Answer, assuming you don't want to use by

Sort the IDs by size:

ids   <- names(sort(table(dt$id)))
n     <- length(ids)

Rearrange so that we alternate between big and small IDs, following Arun's interleaving trick:

alt_ids <- c(ids, rev(ids))[order(c(1:n, 1:n))][1:n]

Split the ids in order, with roughly the same number of IDs in each group (like zero323's answer):

gs  <- split(alt_ids, ceiling(seq(n) / (n/M)))

res <- vector("list", M)
setkey(dt, id)
for (m in 1:M) res[[m]] <- dt[J(gs[[m]])] 
# if using a data.frame, replace the last two lines with
# for (m in 1:M) res[[m]] <- dt[id %in% gs[[m]],] 

Check that the sizes aren't too bad:

# using the OP's example data...

sapply(res, nrow)
# [1] 7 9              for M = 2
# [1] 5 5 6            for M = 3
# [1] 1 6 3 6          for M = 4
# [1] 1 4 2 3 6        for M = 5

Although I emphasized data.table at the top, this should work fine with a data.frame, too.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • The concern is not the speed with which the data is split up, but rather the need to split to M cores. – dzeltzer Aug 20 '15 at 20:41
  • @selig I agree. The part I quoted suggested (to me) that you thought efficiency was an issue in the context of this question (which is just about splitting the data, as far as I can tell). – Frank Aug 20 '15 at 20:44
  • 2
    instead of literally splitting up the data - create a new "parallel.id" column, and then call `dt[, parallel_operation(.SD), by = parallel.id]` – eddi Aug 21 '15 at 18:22
  • @eddi Okay, I've copied your comment into the answer. – Frank Aug 21 '15 at 18:44
1

If k is big enough, you can use this idea to split data into groups:

First, lets find size for each of ids

group_sizes <- dt[, .N, by = id]

Then create 2 empty lists with length of M for detecting size of groups and which ids they would contain

grps_vals <- list()
grps_vals[1 : M] <- c(0)

grps_nms <- list()
grps_nms[1 : M] <- c(0)

(Here I specially added zero values to be able to create list of size M)

Then using loop on every iteration add values to the smallest group. It will make groups roughly equal

for ( i in 1:nrow(group_sizes)){
   sums <- sapply(groups, sum) 
   idx <- which(sums == min(sums))[1]
   groups[[idx]] <- c(groups[[idx]], group_sizes$N[i])
   }

Finally, delete first zero element from list of names :)

grps_nms <- lapply(grps_nms, function(x){x[-1]})

> grps_nms
[[1]]
[1] "a" "d" "f"

[[2]]
[1] "b"

[[3]]
[1] "c" "e"
Vadym B.
  • 681
  • 7
  • 21
  • 1
    I think maybe you want to sort group sizes so that the largest are added first. If the largest were instead added last, the sizes could end up very lopsided. – Frank Aug 20 '15 at 23:17
  • Doing sorting is also a good idea! On the other hand, I find the smallest group on every iteration. Giving the fact that number of M << number of ids, it is enough to do either sorting or detecting the smallest group :) – Vadym B. Aug 21 '15 at 05:45
  • You add to the smallest group, but what you are adding to it will not necessarily be small. Suppose M = 2 and the group sizes are (2,2,4). Because you loop directly over the group sizes, the partition sizes will sequentially be {2,0}, {2,2}, {6,2} as your loop progresses. However, if you sort the group sizes first, to (4,2,2), you will have {4,0}, {4,2}, {4,4}, which is better. I'm not sure if I'm making myself clear here... – Frank Aug 21 '15 at 07:00
  • Yeah, I got your idea. It would be better! Thanks – Vadym B. Aug 21 '15 at 07:17
1

Just an alternative approach using dplyr. Run the chained script step by step to visualise how the dataset changes through each step. It is a simple process.

    library(data.table)
    library(dplyr)

    set.seed(1)
    N <- 16 # in application N is very large
    k <- 6  # in application k << N
    dt <- data.table(id = sample(letters[1:k], N, replace=T), value=runif(N)) %>%
      arrange(id)



dt %>% 
  select(id) %>%
  distinct() %>%                   # select distinct id values
  mutate(group = ntile(id,3)) %>%  # create grouping 
  inner_join(dt, by="id")          # join back initial information

PS: I've learnt lots of useful stuff based on previous answers.

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • This is substantially the same as zero's answer, I think; it effectively ignores `N` (when you use `distinct`). – Frank Aug 20 '15 at 23:18
  • I assumed (just by seeing the a,b,c,... example) that the IDs we need to group have some sort of order in them. If not, then when I get the distinct IDs I can shuffle them, in case the first 2-3 IDs have a huge amount of observations, and use the shuffled version to group. I can update my answer if needed. – AntoniosK Aug 20 '15 at 23:25
  • Also, I think that the objective is not to split the dataset by distributing IDs to the groups, but keeping the same IDs in one of the groups. As he mentions : "the desired split for M=3 is {{a,b}, {c,d}, {e,f}} and for M=4 is {{a,b}, {c}, {d,e}, {f}}". I see each ID in only one of the groups. Am I missing something? – AntoniosK Aug 20 '15 at 23:34
  • Yes, the point is to partition the IDs into groups; all of the answers do this and my comment didn't contradict it, so I don't really know what you're saying. I'm not telling you to update it; you can decide that yourself. Just pointing out that it's roughly the same operation as in zero's answer. – Frank Aug 20 '15 at 23:45
  • No, I didn't say you told me to do anything further. I just got confused for a moment. – AntoniosK Aug 20 '15 at 23:50