How to sort 100GB worth of strings

Question

Given a harddrive with 120GB, 100 of which are filled with the strings of length 256 and 2 GB Ram how do I sort those strings in Java most efficiently? How long will it take?

You would almost definitely need an *in-place* sorting algorithm. — stakx - no longer contributing, Apr 02 '10 at 11:50
How are the strings delimited? As in: is it one sequence with null characters between them or are they a bunch of buffers with some set length and filled with characters. My basic question is: How easy is it to find and move the strings around? — Travis Gockel, Apr 02 '10 at 11:52
This was a Google interview question. I know, because I got the question when I interviewed there. — Paul Tomblin, Apr 02 '10 at 11:53
It is just as important to have the right tool for the job as well as the right strategy. A 160 Gb drive costs £28, assuming there is no space available somewhere on some server or PC. In a company like google, I would find that hard to believe. Having the right hardware would simplify and speed up the task enormously, saving the company money and getting the task finished earlier. Disk space is reusable, your time isn't. — Peter Lawrey, Apr 02 '10 at 17:38
@Peter: sure, but at Google they probably want to know that you can code against the limits of available hardware, *as well as* having the ability to call for more hardware when needed. I very much doubt that if you said, "why can't I just have a machine with 128GB RAM, I know you have some", they'd say "yes, correct answer, next question". They might say, "yes, good answer, now would you like to assume you can't have one, or would you like us to increase the numbers and start again?" ;-) — Steve Jessop, Apr 04 '10 at 01:23
If this is a Google interview question, then they're definitely testing how witty a person can get, so there are a lot of possible smart answers. Telling them to buy hardware after a second of thinking would deny the job. — arscariosus, Sep 04 '10 at 03:15

High Performance Mark · Answer 1 · 2017-01-12T15:19:42.167

23

A1. You probably want to implement some form of merge-sort.

A2: Longer than it would if you had 256GB RAM on your machine.

Edit: stung by criticism, I quote from Wikipedia's article on merge sort:

Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not depend on the number of data elements.

For the same reason it is also useful for sorting data on disk that is too large to fit entirely into primary memory. On tape drives that can run both backwards and forwards, merge passes can be run in both directions, avoiding rewind time.

edited Jan 12 '17 at 15:19

answered Apr 02 '10 at 11:52

High Performance Mark

77,191
7
105
161

Merge sort does not necessarily sort in-place, which would mean that it is impossible to do. – Travis Gockel Apr 02 '10 at 11:54
It requires almost no memory to combine two individually sorted lists into a single sorted list. Merge sort uses this by splitting the list in half, sorting one half, sorting the other half, and then merging. Except that it does this recursively, so the first half of data gets split in half, which gets split in half, which gets split in half.... so that's it's always working on a trivial problem. – Daniel Von Fange Apr 02 '10 at 12:26
4

We should clarify what memory we are talking about. High Performance Mark is certainly correct in that mergesort can operate with O(1) RAM. However, what about disk space? With only 20% more memory than your dataset, during the final merge, you can not hold both an input and the output list fully on disk, because that would already require 150 GB. How do you intend to implement merging to free that memory early? – meriton Apr 05 '10 at 22:23

Stephen C · Answer 2 · 2012-11-20T23:58:25.597

21

Here is how I'd do it:

Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.

Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...

Do a 50-way merge to fill the first 20Gb at the bottom end of disc.
Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.
Repeat steps 1. and 2. until completed.

This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.

EDIT - @meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.

The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)

Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.

edited Nov 20 '12 at 23:58

answered Apr 02 '10 at 14:19

Stephen C

698,415
94
811
1,216

1

Time estimate based on how much data is written (assuming that the computation can be done in parallel and hence is free): 100GB (first phase) + 100GB (final output) + 80GB (slide 1) + 60GB (slide 2) + 40GB (slide 3) + 20GB (slide 4) = 400GB written. About four hours, assuming a conservative 30MB/s sustained write. Faster on decent hardware, but what decent hardware only has 2GB of RAM? ;-) – Steve Jessop Apr 03 '10 at 13:12
... but add some time for the fact that reading/sorting/writing in phase 1 can't be parallel. Also a possible quibble over what "given 2GB of RAM" means. You've assumed the availability of 2GB contiguous address space all backed by RAM not swap, which I think is fair enough given that it's a hypothetical question. But if the *machine* has 2GB RAM and 32-bit addressing, your chunks in the first phase will have to be smaller, resulting in a more-than 50 way sort later. Eventually, a too-many-way merge will get slow. – Steve Jessop Apr 03 '10 at 13:15
I think that an N-way merge can be done with logN comparisons per record written. – Stephen C Apr 03 '10 at 14:10
Sounds about right, since then you can sort N inputs using an N way merge in O(N log N) comparisons, as expected. These things have a habit of ending up zero-sum. If the block size doesn't matter much, then I'd guess that in practice you can speed up the first phase, by having a block reading, a block writing, and a block sorting, simultaneously. Worth an experiment, at least, and if you're writing to the same place you're reading from (not where that data came from originally) that may or may not be ideal. I've never timed the effects of disk seeks. – Steve Jessop Apr 04 '10 at 01:16
Average seek time is 8 to 10ms for typical HDDs is ... according to http://www.pcguide.com/ref/hdd/perf/perf/spec/posSeek-c.html – Stephen C Apr 04 '10 at 01:36
Interesting, but this isn't an average case, I'd want a feel for how much it can vary and so on. If you're writing to the same place you're reading, you're doing small backward seeks, which I can imagine under the right circumstances might be much faster than arbitrary seeks. Plus we can choose exactly where on the disk we're reading and writing, by choosing what order to process the blocks. We might have multiple independent heads on the disk (I can dream). So loads of room for hardware-specific micro-optimizations, none of which we can sensibly estimate without an experiment ;-) – Steve Jessop Apr 04 '10 at 02:02
"this isn't an average case" ... that is correct. I was working on the assumption that if you do really large reads/writes using out 2Gb of memory as a buffer, the contribution of the seeks becomes negligible. But I think that we've already passed the point where the problem/solution is too complicated for theoretical analysis. – Stephen C Apr 04 '10 at 02:24
Also, I just remembered that the question says "in Java", so any attempts to use any of this would be at arm's length. I was just wondering how close one could get to the "ideal" performance. – Steve Jessop Apr 04 '10 at 23:01
If fragmentation of the result file is acceptable, your algorithm could be simplified - and speeded up - by reusing the file system's free block management. For instance, you could sort in reverse order in phase 1, do your n-way merge backwards in phase 2, and simply truncate all input files in step 2. This would reduce disk I/O to 200 GB read and 200GB written (as if sequential, for all practical purposes), which is only a factor of 2 above the theoretical minimum of 100 GB read and written. – meriton Apr 05 '10 at 22:38

score 17 · Accepted Answer · edited May 23 '17 at 11:54

17

I am basically repeating Krystian's answer, but elaborating:

Yes you need to do this more-or-less in place, since you have little RAM available. But naive in-place sorts would be a disaster here just due to the cost of moving strings around.

Rather than actually move strings around, just keep track of which strings should swap with which others and actually move them, once, at the end, to their final spot. That is, if you had 1000 strings, make an array of 1000 ints. array[i] is the location where string i should end up. If array[17] == 133 at the end, it means string 17 should end up in the spot for string 133. array[i] == i for all i to start. Swapping strings, then, is just a matter of swapping two ints.

Then, any in-place algorithm like quicksort works pretty well.

The running time is surely dominated by the final move of the strings. Assuming each one moves, you're moving around about 100GB of data in reasonably-sized writes. I might assume the drive / controller / OS can move about 100MB/sec for you. So, 1000 seconds or so? 20 minutes?

But does it fit in memory? You have 100GB of strings, each of which is 256 bytes. How many strings? 100 * 2^30 / 2^8, or about 419M strings. You need 419M ints, each is 4 bytes, or about 1.7GB. Voila, fits in your 2GB.

edited May 23 '17 at 11:54

Community

1
1

answered Apr 02 '10 at 12:15

Sean Owen

66,182
23
141
173

4

Good point, but I would be a little bit worried worried about seek times. This method sounds like requiring a lot of seeks, so sustained throughput of 100MB/sec may not be the best measure. We have to move around 100*2^30/2^8 ~ 100*2^22 strings. If we are not careful, we might need say one seek per 100 writes. If each seek is 4ms ~ 2^-8 sec, it would take something like 2^14 sec ~ 4.5 h. – Kris Apr 02 '10 at 12:57
I'm obviously a bit slow today -- how do you populate the index array ? I can see that once you have built the index array it is easy and quick to sort in memory, but I don't understand how you set about building it in the first place. – High Performance Mark Apr 02 '10 at 13:25
1

@Kristian - I think that estimate of 1 seek per 100 records written is highly optimistic ... – Stephen C Apr 02 '10 at 14:28
@High - the array just starts out initialized to {0,1,2,...}. I agree with comments that seeks are a factor. Also I didn't factor in time to read strings for comparison, which is not at all trivial. – Sean Owen Apr 02 '10 at 20:20
7

"The running time is surely dominated ..." I challenge you to prove this. In particular, I am interested in how quicksort compares the strings, seeing you don't have enough RAM to store them all. (You are not proposing to read from the disk for every comparison, are you? If you are, you might wish to read up about seek times.) – meriton Apr 05 '10 at 22:46
Yeah I take that back. Off the top of my head I supposed that almost all comparisons need only seek and read a couple bytes to complete, but that's still going to involve a slow seek. quicksort is going to do nlgn comparisons and I suppose we might expect it does < 2n comparisons. Revise the rough estimate upwards to 3000 seconds or so? Very rough. – Sean Owen Apr 06 '10 at 08:55
2

Very rough? More like wild-ass-guess. Assuming you do one disk seek for each comparision done by quick-sort, where quick sort selects pivots optimally and each disk seek takes 0.01 seconds, the time spent seeking is 419000000 * log(419000000) * 0.01 ~= 4 years. Granted, you will have some cache hits so it would not be quite as bad. Still, this solution is at least two orders of magnitude slower than the approach described by Stephen C. – meriton Apr 06 '10 at 22:29
4

Sorry if this answer got under your bonnet, this is just StackOverflow fun-time discussion. I like your thinking, those seeks really must dominate even with optimistic assumptions. Go petition the OP to change the accepted answer so we can rest easy! – Sean Owen Apr 06 '10 at 23:14
1

@SeanOwen: I'm wondering how would you swap two strings in this case? You cannot read all the data into memory. So what would you do? – Nawaz Apr 11 '13 at 03:06

score 6 · Answer 4 · answered Apr 02 '10 at 11:58

6

Sounds like a task that calls for External sorting method. Volume 3 of "The Art of Computer Programming" contains a section with extensive discussion of external sorting methods.

answered Apr 02 '10 at 11:58

Kris

1,388
6
12

@Krystian, do you know of an external sort that doesn't require 2n space? – Marcelo Cantos Apr 02 '10 at 12:07

score 5 · Answer 5 · answered Jan 20 '12 at 22:44

5

I think you should use BogoSort. You might have to modify the algorithm a bit to allow for inplace sorting, but that shouldn't be too hard. :)

answered Jan 20 '12 at 22:44

Alderath

3,761
1
25
43

score 1 · Answer 6 · answered Apr 02 '10 at 12:01

You should use a trie (aka: a prefix tree): to build a tree-like structure that allows you to easily walk through your strings in an ordered manner by comparing their prefixes. In fact, you don't need to store it in memory. You can build the trie as a tree of directories on your file system (obviously, not the one which the data is coming from).

score 0 · Answer 7 · answered Apr 02 '10 at 12:04

0

AFAIK, merge-sort requires as much free space as you have data. This may be a requirement for any external sort that avoids random access, though I'm not sure of this.

answered Apr 02 '10 at 12:04

Marcelo Cantos

181,030
38
327
365

How to sort 100GB worth of strings

7 Answers7

Linked