Say we have a list of strings and we can't load the entire list in to memory, but we can load parts of the list from file. What would be the best way to solve this?
-
2How big is the file? What's your memory limitation? What's the target machine and language? What *kind* of strings? How long are they? How many duplicates are expected? – 0xbe5077ed Jun 23 '14 at 15:29
-
If you're allowed to use external tools, the GNU/Linux `sort` program can sort files larger than memory, and remove duplicates. If the file is already sorted, see the `uniq` program. – Jim Mischel Jun 24 '14 at 14:07
-
@0xbe5077ed: one specific case of this broad question is: http://stackoverflow.com/questions/32535222/memory-constrained-external-sorting-of-strings-with-duplicates-combinedcounted – Peter Cordes Sep 14 '15 at 19:38
2 Answers
One approach would be to use external sort to sort the file, and then remove the duplicates with a single iteration on the sorted list. This approach requries very little extra space and O(nlogn)
accesses to the disk.
Another approach is based on hashing: use a hashcode of the string, and load a sublist that contains all strings whom hashcode is in a specific range. It is guaranteed that if x
is loaded, and it has a duplicate - the dupe will be loaded to the same bucket as well.
This requires O(n*#buckets)
accesses to the disk, but might require more memory. You can invoke the procedure recursively (with different hash functions) if needed.

- 175,853
- 27
- 231
- 333
-
1On the external sort, you can remove the duplicates during the second (i.e. merge) pass of the algorithm. – Jim Mischel Jun 23 '14 at 16:45
-
@JimMischel: Removing duplicates that you can find within a batch in pass1 lets you make fewer batches, saving merge work. You can also sort bigger pass1 batches by building a Trie, or even a DAWG, as you read input, to compactly represent the set of strings you've seen so far (finding duplicates in the process). See my answer on http://stackoverflow.com/questions/32535222/memory-constrained-external-sorting-of-strings-with-duplicates-combinedcounted/32537772#32537772 – Peter Cordes Sep 14 '15 at 19:37
My solution would be to do a merge sort, which allows for external memory usage. After sorting, searching for duplicates would be as easy as only ever comparing two elements.
Example:
0: cat 1: dog 2: bird 3: cat 4: elephant 5: cat
Merge sort
0: bird 1: cat 2: cat 3: cat 4: dog 5: elephant
Then simply compare 0 & 1 -> no duplicates, so move forward. 1 & 2 -> duplicate, remove 1 (this could be as simple as filling it with an empty string to skip over later) compare 2 & 3 -> remove 2 etc.
The reason for removing 1 & 2 rather than 2 & 3 is that it allows for a more efficient comparison -- you don't have to worry about skipping indices that have been removed.

- 332
- 2
- 14