Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

Question

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.

What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.

std::vector stringarray; std::sort(stringarray.begin(), stringarray.end()); is not OK? — 4pie0, Jun 21 '14 at 10:09
@bits_international No, its not ok as `std::sort` does all the sorting in internal memory - we're talking multiple terrabytes of strings here and `std::sort` would _quickly use all the internal memory resources. — , Jun 21 '14 at 10:10
For common sorting algorithms it boils down on the type of data you're dealing with. How big is one string? Are there data restrictions? Can you run some analysis to find some patterns out of them? — Marco A., Jun 21 '14 at 10:12
@MarcoA. The strings are typically less than 1kb, but can be larger, they can hold very different amounts of data and are not partially pre-sorted in any fashion. However, I would make an educated guess that about 70% of the strings have less than 30 characters. the The strings are largely a mix of any kind of text and can be thought of as random in their nature from the viewpoint of figuring out what kind of linear algorithm to use. — , Jun 21 '14 at 10:14
With regard to sorting huge external data you usually chunk them and use a sorting algorithm like mergesort (whose complexity is guaranteed to be O(nLogn) and always use O(n) memory) or Smoothsort (if severely memory-bound). You might also want to take a look at the new research stuff "GPUTeraSort" for sorting external data with accelerators. — Marco A., Jun 21 '14 at 10:25
It is a drastically wrong approach, you must try to use as much memory as you can. You cannot get enough with such large files so you must partition the data into smaller chunks. Big enough to be sorted efficiently in-memory, write back to temporary files. Then merge the files into the final sorted output. — Hans Passant, Jun 21 '14 at 10:31
@HansPassant I'm ok with using the available internal memory resources - just as long as the suggested algorithm don't exceed those primary storage boundaries. Still, there may be multiple concurrent execution operations. — , Jun 21 '14 at 10:41
That depends on what other jobs the machine does. If you need to be a friendly citizen then don't go over, say, the amount of RAM divided by 2. You also want to avoid starving the file system cache. Clearly making this a config setting and experimenting is a wise thing to do. — Hans Passant, Jun 21 '14 at 10:43
@HansPassant I just updated the text on that point and you just beat me too it :) Anyways, there may be multiple concurrent execution operations that warrants sorting. — , Jun 21 '14 at 10:45
Well, don't do that. Sending the disk reader head back-and-forth with multiple processes doing this is going to kill perf. The one thing you never want to do is waiting for it to finish. — Hans Passant, Jun 21 '14 at 10:47
Can you already pre-sort the strings when entering them? In that case you would have, e.g., 10000 files with 100 MB each, and then you would be able to sort them in memory. A trivial example would be to have files by the first *n* letters of the string. That structure would enable you still to add strings easily but the reading would be much easier. — DrV, Jun 21 '14 at 10:57
How often are the strings changed and how often do they need to be resorted? Building some type of index that gets updated on an insert might be less work in the long run. — brian beuning, Jun 21 '14 at 13:13
Or even sorting whenever there is time. In that case you could have a big ordered file (or files, as a 1 TiB file may wreak havoc with the file system) and then rest of the data is unordered. Then most of the data can be fetched fast, and only the remaining smaller file needs to be sorted when queried. But this all depends on the actual write and read rates. And, yes, databases exist. — DrV, Jun 21 '14 at 21:12

score 5 · Answer 1 · edited May 23 '17 at 11:51

5

You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:

If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))

Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.

Google servers usually have this sort of problems.

edited May 23 '17 at 11:51

Community

1
1

answered Jun 21 '14 at 10:31

Marco A.

43,032
26
132
246

Without an actual implementation that limit (disk) IO in clever ways, these algorithms alone don't give an answer about how to read and write efficiently to/from disk. In genearal: Use as much RAM as you can, limit disk IO where possible. Check my answer. – benjist Jun 22 '14 at 18:22
The compromise between chunk sizes and the "use as much RAM as you can" is a delicate one and it depends on the hardware and resources you can use. There's no exact answer to OP's question unless you're the guy who's implementing this stuff and has the "bigger picture" of his system. – Marco A. Jun 22 '14 at 18:28
I agree that an exact answer can't be given. But I disagree that this answers (or hints at) the "read/write efficiently to/from disk" part of his question. Though your suggested algorithms could work in combination (SmoothSort on each chunk, then merge chunks, as I described). But only then will the IO part be answered - note also that your Google link is about sorting in memory. – benjist Jun 22 '14 at 19:07

terjanq · Accepted Answer · 2014-06-21T14:23:46.827

Construct simply digital tree (Trie) Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.

PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.

+1 for the suggestion of a _Trie_, I will look into this. Thanks. — , Jun 22 '14 at 06:45

score 1 · Answer 3 · edited May 23 '17 at 12:22

I suggest you use the Unix "sort" command that can easily handle such files. See How could the UNIX sort command sort a very large file? .

Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.

Such programs are known as external sorting algorithms.

My understanding is that the Unix "sort" command uses the merge sort algorithm. Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):

Name four tape drives as A, B, C, D, with the original data on A:

Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.

Practical implementations typically have many tweaks:

Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.

The same article does also point out that using larger RAM buffers can significantly decrease the sort time due to needing fewer seeks. 1000 or so disk seeks per pass wouldn't really have any sort of impact on performance if you're working in 8GB chunks, which isn't too outlandish as far as server RAM goes these days. Even 1GB would likely not suffer. The amount of time spent sorting would far outweigh any seek time. — Kaslai, Jun 23 '14 at 14:55

benjist · Answer 4 · 2014-06-22T13:54:23.027

Use as much memory as you can and chunk your data. Read one chunk at a time into memory.

Step 1) Sort entries inside chunks

For each chunk:

Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk

Step 2) Merge all strings from sorted chunks into resulting sorted strings file

Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final

I think this could be an IO efficient way. But I've never implemented such thing.

However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.

[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html

score -1 · Answer 5 · answered Jun 22 '14 at 14:24

Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.

1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".

2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.

3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.

4: Load as many strings as you can into this buffer.

5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.

6: Iterate over the strings and store them in their corresponding buffer.

7: Flush all of the buffers to their corresponding files. Clear all the buffers.

8: Go to step 4 and repeat until you have exhausted your source of strings.

9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.

10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.

If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.

Good luck with your project.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

5 Answers5