Let's say there is a data set of strings that cannot all fit into memory together and we want to remove all duplicates.
I am not looking for code but hoping someone can walk me through this.
If I could fit the entire data set into memory, I would sort the set, then iterate through and remove elements (if the current element is same as previous element).
In this actual case, I was thinking load each workable "chunk" of the dataset to memory, sort it, remove dupes, and then do this iteratively over each chunk. This seems pretty inefficient, and it only works if I can get the entire data set to fit into memory to remove remaining duplicates in the last iteration.
Suggestions?
Edit: The way I approached this earlier for a small problem was to maintain a hash table in memory, iterate through each chunk of the data set that can fit into memory, add the string to the hash table if it doesn't exist, otherwise skip it. Can we do better?