1

I want to remove duplicated values in my data. I know it is frequently observed questions in stackoverflow but my problem is a little different because now I am handling very large size of data. Therefore I have to consider the execution time the most in my code.

As below snippet, I made a simple code for removing duplicated values.

// Suppose that the dataset is very huge so that
// multi-node resources should be necessary.    
String[] data = new String[10_000_000];

HashMap<String, String> uniqueItems = new HashMap<>();

for (int i = 0; i < data.length; i++) {
    if (uniqueItems.containsKey(data[i])) {
        uniqueItems.remove(data[i]);
        uniqueItems.put(data[i], "inserted");
    } else {
        uniqueItems.put(data[i], "inserted");
    }
}

However, I don't like it because I think that other better data structures or different algorithms could efficiently remove duplicated than my code.

So I want to look for better ways to quickly remove duplicated values when the data is large.
I appreciate it if you could let me know the fastest way of removing duplicated values.

And also, I am wondering if the number of duplicated values could affect the performance. I mean if the duplicated values is 50% of the original data, then the selection of best algorithm and data structures will be changed? If so, I want to find a way that can achieve good performance in general cases.

Paulo Mattos
  • 18,845
  • 10
  • 77
  • 85
sclee1
  • 1,095
  • 1
  • 15
  • 36
  • 3
    Would you consider using a `Set` ? – vikingsteve May 18 '17 at 12:25
  • 1
    See http://stackoverflow.com/questions/40717681/removing-duplicates-in-java-on-large-scale-data and http://stackoverflow.com/questions/16693408/java-optimize-hashset-for-large-scale-duplicate-detection for example. Or go for hadoop/map-reduce: http://stackoverflow.com/questions/38065737/how-to-remove-duplicate-values-using-mapreduce ... so how much time again did you spent on doing your research? – GhostCat May 18 '17 at 12:28
  • 2
    Why do you remove and then re-insert the duplicate items? – tobias_k May 18 '17 at 12:29
  • 1
    For very large data sets perhaps a modified Merge-Sort which deletes all duplicates as soon as they are found might work well. – rossum May 18 '17 at 12:30
  • Replace the whole thing with `Set uniques = new HashSet<>(Arrays.asList(data));`. – shmosel May 18 '17 at 22:46

2 Answers2

6

Convert your uniqueItems to a HashSet<String> and your for loop to simply:

uniqueItems.add(data[i]);

If add returns true then you have inserted an unique string; false if duplicated.

Both algorithms should run in O(n) time in the best case, but using a HashMap when you don't care about the value (for a given key) is silly and wastes resources. A HashSet is a better fit for such cases.

You could also try a TreeSet<String> to see what works best for your particular dataset. Probably will be worse, given JDK 8 new HashSet implementation: over crowded buckets are stored as mini tree sets automatically, providing a competitive performance even when the hashing function behaves badly. (This optimization is only possible for Comparable types such as String.)


Brute force searching arrays. In a simple, array based algorithm, where you search the entire array before inserting each element, you would get a very bad O(n²) performance.

As such, you might be tempted to sort your data first, placing duplicated elements near each other. This buys you a faster O(n log n) performance, but still behind the HashMap/HashSet version in the general case.


Linear is theoretical best. You can't detect all duplicates without visiting each element at least once. As such, our current O(n) time complexity is really the best you can do here.

Of course, you can always try shaving down some of the hidden constants in the Big O notation, but you won't arrive in a asymptotically better algorithm.

Paulo Mattos
  • 18,845
  • 10
  • 77
  • 85
  • Is hashSet better than HashMap? I didn't get it – sclee1 May 18 '17 at 12:35
  • @sclee1 Both share the same hash table algorithm. But, if you really don't care about the stored value (for each key) then using a Hashmap is silly; you are better off using a Set instead. – Paulo Mattos May 18 '17 at 12:38
  • Thanks for your comment. why your solution gives me O(n) if some condition are satisfied ?(because you say I need the luck) – sclee1 May 18 '17 at 12:53
  • @sclee1 It depends on how your data is stored inside the ultra fast hashmap. Please see [this answer](http://stackoverflow.com/a/4553642) for further considerations. – Paulo Mattos May 18 '17 at 13:00
  • Why do you think that using a HashMap had O(n²)? – tobias_k May 18 '17 at 13:58
  • @tobias_k you are right mate! I somehow thought he was using a simple array based algorithm. Running low on coffee here... :) – Paulo Mattos May 18 '17 at 14:04
  • HashSet is costly in terms of memory consumption. Could be OK for a small collection, but 100GB+ lists that could be prohibitive. Better sort the array using QuickSort or MergeSort in O(NlogN) time, then remove duplicate items in O(n) time. – WebViewer Dec 08 '22 at 13:52
1

In your example the data[i] values are used as 'key's in HashMap uniqueItems.

A HaspMap will always have unique keys. An existing key will be overwritten by a put() operation. You need no conatinsKey() if you want to add a new element.

Why do you remove and insert an existing key ?

Norbert
  • 234
  • 1
  • 9