5

I have a data file with a large number of values (53,000,000+) and I would like to pull out a random subset of n of these values (say, 2,000,000). I implemented a Perl script that pulls the list into memory, uses the Fisher-Yates method to shuffle the array, and then prints out the first n values in the shuffled list. However, this shuffling process is taking a lot of time, even on much smaller test sets (50,000 values).

I'm looking for a more efficient, scalable way to identify a random subset of a huge set of values and print it out. Any suggestions?

Update: Based on the answers and some more searching, it looks like the correct terminology is "random sampling".

Daniel Standage
  • 8,136
  • 19
  • 69
  • 116
  • How are you swapping)? Is the I/O the bottleneck or the shuffling? Perhaps some code would help too. –  Sep 20 '11 at 16:30
  • @delnan I tried both methods described on the linked thread. I/O is definitely not a problem. It loads into memory pretty quickly, but then spends *a lot* of time on the shuffling step. It never finishes and starts printing. Now that I've tried it, I don't think shuffling approaches are going to be efficient enough for this many values, so I'm probably more interested in alternative approaches. – Daniel Standage Sep 20 '11 at 16:35
  • How random do you need the data to be? You might be able to use a loop that jumps by random indicies and then marks the element as 'used' after retrieving it to prevent duplicates. – Brian McFarland Sep 20 '11 at 16:45

4 Answers4

4

Elaborating on aix's answer above, to choose k out of a stream of items, read the items one at a time. Keep the first k items in a set S.

Now when reading the m-th item I (m>k now), keep it with probability k/m. If you do keep it, select an item U uniformly at random from S, and replace U with I.

The proof that this yields all subsets of size k with equal probability is based on induction on m. Note that you don't need to know n (the total number of items) in advance, and that S at each step is suitable. The algorithm is "streaming" - it doesn't require storing all items, or making a second pass.

adnan
  • 196
  • 7
  • this is an online algorithm, perfect for streams with unknown size. but size is known here, so you can do better than that – Karoly Horvath Sep 21 '11 at 00:09
  • you can do better in the sense that you need to make fewer calls to the random number generator (k instead of n) when n is known in advance, and you can store the set in memory. time complexity-wise, both are O(n), and the online algorithm uses O(1) space, rather than \Theta(n) – adnan Sep 22 '11 at 14:06
  • nope, if the set is already in memory, the time complexity is only O(k). if not, then it's a memory vs speed tradeoff, so in that sense you're right. – Karoly Horvath Sep 22 '11 at 14:10
  • Perhaps it wasn't clear from the problem statement, but I don't know `n` beforehand--I was giving a rough idea to show the scale of the problem. And given the nature of the items, loading the whole set into memory is not feasible. So I like this answer a lot. – Daniel Standage Sep 23 '11 at 13:48
1

First, check your implementation of the shuffle. If implemented correctly that should give you linear time. Also, modify the algorithm to stop after the desired number of elements have been shuffled: there's no need (practically and theoretically) to shuffle more numbers than you actually output.

If you ask for k numbers this will then cost you k elemental operations. I doubt you can do a lot better than that.

Michael Nett
  • 375
  • 1
  • 10
1

Don't shuffle, it's unnecessarily expensive.

There's a simple linear algorithm discussed in Jon Bentley's "Programming Pearls" (which Bentley says he learnt from Knuth's "Seminumerical Algorithms"). Use this method instead.

There are some Perl implementations about:

These two snippets implement Algortihm S(3.4.2) and Algortihm R(3.4.2) from Knuth's Art Of programming. The first randomly selects N items from an array of elements, and returns a reference to an array containing the elements. Note that it will not necessarily consider all of the elements in the list.

The second randomly selects N items from a file of indeterminate size and returns an array containing the selected elements. Records in the file are assumed to be per line, and the lines are chomped while reading. This requires only 1 pass through the list. A slight modification can be made to use the snippet in situations where N records would exceed memory limitations, however this requires slightly more than 1 pass (/msg if you need this explained)

Community
  • 1
  • 1
NPE
  • 486,780
  • 108
  • 951
  • 1,012
0

Reading and shuffling the array would involve a lot of unnecessary data movement.

Here are a few ideas:

One: When you say you need a random subset, what exactly do you mean by "random" in this context? By which I mean, are the records in any particular order, or is the order relevant to whatever it is you are trying to randomize?

Because my first thought is that if the records are not in any relevant order, than you can get a random selection by simply calculating total size divided by sample size, and then selecting every n-th record. So for example, if you have 53 million records and you want a sample of 2 million, take 53 millions / 2 million ~= 26, so read every 26th record.

Two: If that's not adequate, a more rigorous solution would be to generate 2 million random numbers in the range of zero to 53 million, insuring no duplicates.

Two-A: If you're sample size was small compared to the total number of records, like if you were just picking out a few hundred or a few thousand, I'd generate an array of however many entries, and for each entry, compare it to all previous entries to check for duplicates. If it's a duplicate, loop around and try again until you find a unique value.

Two-B: Assuming your numbers are not just examples but the actual values, then your sample size is large compared to the total population. In that case, given the ample memory on modern computers, you should be able to do this efficiently by creating an array of 53 million booleans initialized to false, each, of course, representing one record. Then run through a loop 2 million times. For each iteration, generate a random number from 0 to 53 million. Check the corresponding boolean in the array: If it's false, set it to true. If it's true, generate another random number and try again.

Three: Or wait, here's a better idea yet, given the relatively large percentage: Calculate the percentage of records you want to include. Then loop through a counter of all the records. For each, generate a random number from 0 to 1 and compare it to the desired percentage. If it's less, read that record and include it in the sample. If it's greater, skip the record.

If it's important to get the exact number of sample records, you can recalculate the percentage for each record. For example -- and to keep the example simple, let's pretend you want 10 out of 100 records:

You'd start with 10 / 100 = .1 So we generate a random number, say it comes up .04. .04<.1, so we include record #1.

Now we recalculate the percentage. We want 9 more records out of 99 remaining gives 9/99~=.0909 Say our random number is .87. That's greater, so we skip record #2.

Recalculate again. We still need 9 records out of 98 remaining. So the magic number is 9/98, whatever that comes to. Etc.

Once we've got as many records as we want, the probability for future records will be zero, so we'll never go over. If we near the end and haven't picked up enough records, the probability will get very close to 100%. Like, if we still need 8 records and there are only 8 records left, the probability will be 8/8=100% so we'll be guaranteed to take the next record.

Jay
  • 26,876
  • 10
  • 61
  • 112