Removing duplicate strings with limited memory

Question

Say we have a list of strings and we can't load the entire list in to memory, but we can load parts of the list from file. What would be the best way to solve this?

How big is the file? What's your memory limitation? What's the target machine and language? What *kind* of strings? How long are they? How many duplicates are expected? — 0xbe5077ed, Jun 23 '14 at 15:29
If you're allowed to use external tools, the GNU/Linux `sort` program can sort files larger than memory, and remove duplicates. If the file is already sorted, see the `uniq` program. — Jim Mischel, Jun 24 '14 at 14:07
@0xbe5077ed: one specific case of this broad question is: http://stackoverflow.com/questions/32535222/memory-constrained-external-sorting-of-strings-with-duplicates-combinedcounted — Peter Cordes, Sep 14 '15 at 19:38

score 11 · Answer 1 · answered Jun 23 '14 at 15:30

11

One approach would be to use external sort to sort the file, and then remove the duplicates with a single iteration on the sorted list. This approach requries very little extra space and O(nlogn) accesses to the disk.

Another approach is based on hashing: use a hashcode of the string, and load a sublist that contains all strings whom hashcode is in a specific range. It is guaranteed that if x is loaded, and it has a duplicate - the dupe will be loaded to the same bucket as well.
This requires O(n*#buckets) accesses to the disk, but might require more memory. You can invoke the procedure recursively (with different hash functions) if needed.

answered Jun 23 '14 at 15:30

amit

175,853
27
231
333

1

On the external sort, you can remove the duplicates during the second (i.e. merge) pass of the algorithm. – Jim Mischel Jun 23 '14 at 16:45
@JimMischel: Removing duplicates that you can find within a batch in pass1 lets you make fewer batches, saving merge work. You can also sort bigger pass1 batches by building a Trie, or even a DAWG, as you read input, to compactly represent the set of strings you've seen so far (finding duplicates in the process). See my answer on http://stackoverflow.com/questions/32535222/memory-constrained-external-sorting-of-strings-with-duplicates-combinedcounted/32537772#32537772 – Peter Cordes Sep 14 '15 at 19:37

score 0 · Answer 2 · answered Jun 23 '14 at 15:31

My solution would be to do a merge sort, which allows for external memory usage. After sorting, searching for duplicates would be as easy as only ever comparing two elements.

Example:

0: cat 1: dog 2: bird 3: cat 4: elephant 5: cat

Merge sort

0: bird 1: cat 2: cat 3: cat 4: dog 5: elephant

Then simply compare 0 & 1 -> no duplicates, so move forward. 1 & 2 -> duplicate, remove 1 (this could be as simple as filling it with an empty string to skip over later) compare 2 & 3 -> remove 2 etc.

The reason for removing 1 & 2 rather than 2 & 3 is that it allows for a more efficient comparison -- you don't have to worry about skipping indices that have been removed.

Removing duplicate strings with limited memory

2 Answers2