Split data.table in roughly equal parts alphabetically

Question

I am trying to split a relatively large data.table into roughly equal parts to output the names into csv files. I've already implemented a solution based on Split data.table into roughly equal parts

library(data.table)
library(plyr)
dt <- data.table(Name =  letters)
bins <- 5
dt[order(-rank(Name)), Split.ID := as.integer(runif(.N,0,bins))]

d_ply(dt[order(-rank(Name))], .(Split.ID),
      function(sdf) write.csv(x = sdf[,1] , file = paste0("Output/test.",sdf$Split.ID[[1]],".csv"), quote = FALSE, row.names = FALSE, eol = ", "))

The problem with this solution is that the ordering of the names in the csv files is alphabetical, but because of the randomness of the Split.ID it is not preserved for all csv files.

`dt[, g := .I %% 5]; split(dt, by="g")` seems to work. To confirm, check `dt[, .N, by=g]` for the counts. This doesn't split contiguous chunks of `dt`, but you could fix that fairly easily I guess. Also, per the answer in the link, you can do `dt[, write.csv(.SD, ...), by=g]` instead of `split`. Also, fyi, data.table's `fwrite` command can write csvs very efficiently if that's a concern. — Frank, Feb 09 '17 at 15:23
Thanks for that, seems to be a lot easier, but it's not yet completely the desired result. I would like to have five consecutive letters in one csv file. So that inside one csv file everything is in the desired order. — hannes101, Feb 09 '17 at 15:35
That's just a matter of math... this seems to work: `dt[, g := cut(.I, breaks=5, labels=1:5)]`. By the way, looks like you don't need to load plyr here...? — Frank, Feb 09 '17 at 15:56
Ah alright, thanks. I thought plyr is necessary for `d_ply` in my preliminiary code. — hannes101, Feb 09 '17 at 16:04
Oh, nope, my mistake. I didn't notice the `d_ply`; hadn't run your code that far. — Frank, Feb 09 '17 at 16:06

Split data.table in roughly equal parts alphabetically

0 Answers0