Remove duplicates from data set that won't fit in memory?

Question

Let's say there is a data set of strings that cannot all fit into memory together and we want to remove all duplicates.

I am not looking for code but hoping someone can walk me through this.

If I could fit the entire data set into memory, I would sort the set, then iterate through and remove elements (if the current element is same as previous element).

In this actual case, I was thinking load each workable "chunk" of the dataset to memory, sort it, remove dupes, and then do this iteratively over each chunk. This seems pretty inefficient, and it only works if I can get the entire data set to fit into memory to remove remaining duplicates in the last iteration.

Suggestions?

Edit: The way I approached this earlier for a small problem was to maintain a hash table in memory, iterate through each chunk of the data set that can fit into memory, add the string to the hash table if it doesn't exist, otherwise skip it. Can we do better?

Probably not performant, but: get first string, search the rest of the dataset (in chunks or one by one) and remove dupes, move to the next, rinse, repeat . This is too broad of course, and how to do it depends on where you load the data from actually, and what the performance bottlenecks would be (would it be loading? transmitting? sorting?). Paralellization may help depending on the dataset origin — Jcl, May 02 '16 at 06:54
Well, more details are needed about the data origin then: is your dataset sorted? can the origin sort it using indexes (like a database)? "A dataset" is just too broad... if your dataset is a stream of random strings that must be read sequentially or using a cursor, there's no way to remove dupes other than bruteforcing (at least if the amount of non-equal strings will not fit in memory either)... if it's indexed or sorted, then other approaches are possible. — Jcl, May 02 '16 at 07:00
Not sorted. Aside from that, the "source" shouldn't really matter. For simplicity's sake, you can assume it's being read from a flat file. — Beebunny, May 02 '16 at 07:02
Again, if read from a flat file (or any other sequential cursor), it's not sorted, and the non-equal strings don't fit in memory either, I personally don't think there's any better way than what you are using in your edit, but I'm definitely curious about what others may answer — Jcl, May 02 '16 at 07:04
http://stackoverflow.com/questions/1645566/efficient-out-of-core-sorting?rq=1 — Beebunny, May 02 '16 at 07:20

score 3 · Accepted Answer · edited May 23 '17 at 12:23

3

What I was looking for is called External Sorting.

https://en.wikipedia.org/wiki/External_sorting

Also, my question is a duplicate of this: Efficient Out-Of-Core Sorting

edited May 23 '17 at 12:23

Community

1
1

answered May 02 '16 at 07:19

Beebunny

4,190
4
24
37

score 0 · Answer 2 · answered May 02 '16 at 08:27

0

If the number of Strings that occur more than once in the List is not too big could try this:

Assumption:
I assume the number of different Strings in the list is so small that these Strings could fit into memory.

Solution:
You could iterate over the file and simply keep a Set of all already read Strings in a Set and skip all read Strings that are already in the Set (because they are duplicates).

answered May 02 '16 at 08:27

MrSmith42

9,961
6
38
49

1

If the data set can fit into memory, yes - check if a string exists in the set (hash table, rather). If not, add it to the map. If it already exists, skip it. However, that's NOT the question. The question was about working with large sets that cannot fit into memory. – Beebunny May 02 '16 at 18:36
@Beebunny: As far as I understand the question, the all entries (including the duplicates) do not fit in the memory. That does not mean that a set of all entries WITHOUT THE DUPLICATES would also not fit in the memory. – MrSmith42 May 06 '16 at 07:33
Whether the data set without duplicates fits all into memory at once is besides the point. The question is about removing dupes from a data set that won't fit into memory. Anyway, there's a Wiki link if you're interested in the marked answer. – Beebunny May 06 '16 at 18:09
@Beebunny: My answer is only for the case that data set without duplicates fits all into memory at once (as I mentioned in my answer. If this is the case for your problem is not mentioned in your question. If not, my answer may be useful for someone else. – MrSmith42 May 09 '16 at 05:58

score 0 · Answer 3 · answered Feb 19 '19 at 07:20

I think what you might be looking for more specifically would be Bundle Sorting (which is also an external sorting algorithm). It is well suited for duplicate removal. An efficient algorithm can be found here: Efficient Bundle Sorting. Just putting this here for someone who came looking for a specific algorithm.

Remove duplicates from data set that won't fit in memory?

3 Answers3