2

A typical algorithm to sort 232 numbers would be:

  1. Create an array of 232 numbers and fill them from 0 to 232-1
  2. Let n = number of items in the array = 232
  3. Randomly pick a number from 0 to n-1, remove number out of the array, and push it onto a stack
  4. Now, n is decremented by 1 and stack size is incremented by 1
  5. goto 3. until array is empty, the final stack is the solution

232 = 4,294,967,296 items

232 * 4 = 17,179,869,184 bytes, if we use 4 byte unsigned ints

Since I don't have that much memory on one machine, using memmap() might be a good candidate (probably the most straight-forward approach).

Out of curiosity, I was wondering if I can use MapReduce to solve this problem? What would the Map and Reduce functions look like?

This idea crossed my mind, because although I don't have that much memory on one machine, I definitely have that much memory in all the boxes I have on the LAN. The distributed nature of the data in MapReduce might help.

Although alternative, equivalent algorithms which fit MapReduce are welcome, it may be difficult to come up with one which doesn't degrade the randomness of the above algorithm.

kfmfe04
  • 14,936
  • 14
  • 74
  • 140
  • What you describe is a very odd way of doing a shuffle. Look up "fisher-yates". – Nick Johnson Nov 14 '11 at 04:21
  • Also, out of curiosity, why do you need to shuffle the numbers from 0 to 2^32-1? – Nick Johnson Nov 14 '11 at 04:21
  • Actually, it's essentially the same as Fisher-Yates as described in Wikipedia. The "Modern Version" by Durstenfeld with the swap is a bit nicer, though - but still has the same issue. – kfmfe04 Nov 14 '11 at 08:00
  • Presumably the missing step in your instructions is to replace the selected element with the last one in the array - otherwise, compacting the array is an O(n) operation. I'm still curious why you need to do this, though. – Nick Johnson Nov 14 '11 at 08:35
  • I need an unique (non-repeating) 32-bit key for indexing purposes – kfmfe04 Nov 14 '11 at 08:50
  • Hmm... ...I don't think memmove() in C for a contiguous block of memory depends on n (compacting an array). If it were a linked list, that's a different matter. – kfmfe04 Nov 14 '11 at 09:04

4 Answers4

5

The paper "MapReduce: Simplified Data Processing on Large Clusters" describes (Page 3, just before section 3) how to use MapReduce to do a distributed sort. One way to do a random shuffle of 2^32 numbers is to give each number a randomly generated 80-bit key, and then sort the number+key by this key. With 80-bit keys, there will be very few ties (expected number about 2^-17), and you can use a final pass to put them into a random order.

No doubt there are better ways to do this if you are prepared to do a lot of relatively low level programming, but both random shuffle and sorting need to do a lot of serious data movement between machines, and it is likely that a lot of work will have been put in to make sorting clever about this - this way you get to reuse it.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • Very nice insight! I will dig into this approach a bit more. – kfmfe04 Nov 14 '11 at 08:03
  • 1
    http://sortbenchmark.org/YahooHadoop.pdf has more details on how the sorting has to be done. The code is part of the Hadoop API - http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/examples/terasort/package-frame.html – Praveen Sripati Nov 14 '11 at 12:51
2

If you just need to be able to sample elements from a large random permutation, you don't have to realize it by creating and shuffling the whole thing. Check out this blog post for an example of how to generate a 'secure' (cryptographically intractable to guess) permutation from a block cipher.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
  • ty - the blog is an interesting start - inside the comments, there is a link to a useful paper that describes the cipher process in detail (http://www.cs.ucdavis.edu/~rogaway/papers/subset.pdf). – kfmfe04 Nov 15 '11 at 01:00
1

Your mapping step could be applying the Fisher-Yates algorithm on subarrays of your input.

The reducing step would then have to combine shuffled subarrays through a random merge (taking the remaining size of the parts into account at each step).

However, I do not think that this offers any advantage over simply doing a Fisher-Yates shuffle on disk on a single machine, since all it does is replace the bottleneck of random disk access with the bottleneck of network speed.

Svante
  • 50,694
  • 11
  • 78
  • 122
  • As mentioned in the OP, the big bottleneck is in the space covered (insufficient memory on one machine), not in runtime speed. Doing F-Y on one machine would require 16GB of RAM on one machine, which I don't have. – kfmfe04 Nov 14 '11 at 08:05
  • BTW, doing F-Y on subarrays will create biases in the shuffle. – kfmfe04 Nov 14 '11 at 08:12
0

I need an unique (non-repeating) 32-bit key for indexing purposes

Why don't you maintain a counter in the application and increment it.

If it's a distributed application then you could you ZooKeeper. There is a similar SO thread.

ZooKeeper runs in Java and has bindings for both Java and C.

Community
  • 1
  • 1
Praveen Sripati
  • 32,799
  • 16
  • 80
  • 117