1

I have a list of about 100M+ elements (currently sorted) that I want to randomize (shuffle) AND chunk/split into smaller lists (about 50K buckets). What's the best approach to do this in terms of maximizing speed?

I'm open to any libraries or languages (currently using node and python for the project) if they have fast pre-built methods. Thanks!

P.S. This isn't just a theoretical exercise, I'm trying to figure this out for my internship since we'll be running another script in parallel using about 50K Digital Ocean nodes that takes the smaller lists as an input.

kashkar
  • 663
  • 1
  • 8
  • 22
  • 5
    Have you developed a method yourself, tested it, and found it too slow for your purposes? – TigerhawkT3 Jul 11 '15 at 21:31
  • It would be faster to just shuffle a list of integer element indices than the list itself. The same might be true for bucketing it into smaller lists (i.e. create a list of index ranges). Both could be done using built-ins. – martineau Jul 11 '15 at 21:50
  • 2
    We don't suggest the best approach here. You post your approach and we try to make it better. – Swastik Padhi Jul 11 '15 at 22:04

1 Answers1

1

Do this in C or C++ for max speed.

Use the "modern" Fisher-Yates shuffle on your array of records. Use a fast random, perhaps one found on stack overflow.

Then, return addresses of elements in the array at bucksize(=50000) offset, eg &array[0], &array[50000], &array[100000]...

Community
  • 1
  • 1
fluffybunny
  • 496
  • 3
  • 8