Good Procedure to remove duplicates

Question

I've a huge string(1GB) with SPACE delimiter, I'll convert it into Array[]. My string contains lots of duplicates. I've to sort the string and remove duplicates. I've made up 2 procedures and I'm not able to decide one among these two.

Procedure 1

I assume that sorting string is costly process, I wanted to remove duplicates using HashSet and then sort.

Procedure 2

I sort the Array and remove duplicates using formal procedure of comparing sorted Array with its previous value to next value and remove duplicates.

From my point of view, 1st procedure seems good. But I'm not aware if I run into any errors. Which one will be good..?

1GB _string_. You're kidding, right. If not, consider an alternate approach. — devnull, Apr 20 '14 at 18:11
maybe look at this? http://stackoverflow.com/questions/1532819/algorithm-efficient-way-to-remove-duplicate-integers-from-an-array. Though I'm not sure how efficient it will be with such a big string. — Sudo Andrew, Apr 20 '14 at 18:14
How you get this 1gb of string? You could do more work here and avoid to work with 1gb string.. — Marco Acierno, Apr 20 '14 at 18:15
use #intern to save space, or use a string pool based on a hashset, which would be procedure 1. — Thomas Jungblut, Apr 20 '14 at 18:15
@devnull There are lot of duplicates, may be more than 80% of data has duplicate values. — Prashanth, Apr 20 '14 at 18:45

score 2 · Accepted Answer · answered Apr 20 '14 at 18:21

Assuming memory is not an issue, the most efficient approach, performance-wise, is probably:

String s = someOneGbString();
String[] words = s.split("\\s+");
Set<String> noDupes = new HashSet<>();
Collections.addAll(noDupes, words);

And if you need it sorted:

Set<String> sorted = new TreeSet<> (noDupes);

Or with Java 8:

Set<String> sorted = Arrays.stream(s.split("\\s+"))
                           .sorted()
                           .collect(toSet());

score 1 · Answer 2 · answered Apr 20 '14 at 18:33

Case 1: Memory < ~1GB

You can use external merge sort. http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

Case 2: Memory > ~1GB

Read the whole String. Split it into an array (String[]). Use in-place quicksort. Iterate over the array and check if sequential neighboring strings are the same or not. As substrings are not copies of the original String but simply refers to the memory location in the String pool, this will be space efficient.

Time Complexity: O(nlogn)

Case 3: Memory >> ~1GB

Do as others suggested. Use a TreeSet or a HashSet. For TreeSet, each insertion will be O(logn) so total is O(nlogn). However this will be less efficient than quicksort in terms of both time and space. HashSet is more complicated depending on the hash function. Under most circumstances, it will do OK, with O(n) time complexity.

Good Procedure to remove duplicates

Procedure 1

Procedure 2

2 Answers2