Find Duplicates in an array in O(N) time

Question

Is there a way to find all the duplicate elements in an array of N elements in O(N) time?

Example:

Input: 11, 29, 81, 14, 43, 43, 81, 29

Output: 29, 81, 43

Sorting the input and doing a linear scan to detect duplicates destroys the order and gives the output: 29,43,81.

Sorting-by-key another array of indices {0,1,...N-1} according to the given array to get {1,4,2} and then sorting the resultant set of indices to get {1,2,4} will give us {29,81,43}, but this takes O(N logN) time.

Is there an O(N) algorithm to solve this problem?

P.S. I forgot to add: I dont want to use hash tables. I am looking for a non-hash solution.

If space is not a restriction, store each element in a hash. When a collision occurs, you have a duplicate. — Anurag, Oct 01 '11 at 06:03
@Anurag: That has best case/average running time O(n) but worst case O(n2). — Mark Byers, Oct 01 '11 at 06:04
Do you have knowledge about the range of values that the array elements can take? — CB Bailey, Oct 01 '11 at 06:05
possible duplicate of [Algorithm: efficient way to remove duplicate integers from an array](http://stackoverflow.com/questions/1532819/algorithm-efficient-way-to-remove-duplicate-integers-from-an-array) — n. m. could be an AI, Oct 01 '11 at 09:14

score 16 · Accepted Answer · answered Oct 01 '11 at 18:41

I believe a good solution (decent memory usage, can be used to immediately determine if an entry has already been seen thus preserving order, and with a linear complexity) is a trie.

If you insert the elements into the trie as if they were a string with each digit (starting from the MSD) in each node, you can pull this off with a complexity of O(m N) where m is the average length of numbers in base-10 digits.

You'd just loop over all your entries and insert them into the trie. Each time an element already exists, you skip it and move on to the next. Duplicates in this (unlike in my previous answer of a Radix Sort) will be found immediately instead of in the last iteration or what not.

I'm not sure if you would benefit from using a suffix tree here, as the "base" of the characters being entered into the trie is only 10 (compared to the base-128 for ANSI strings), but it's possible.

You're welcome. And thanks, @amit, especially for your patience with me last night! — Mahmoud Al-Qudsi, Oct 02 '11 at 01:19
Using a trie makes the algorithm O(N log N). Even worse, all trie implementation are very slow compared to a hashmap as Kaganar proposes. If you want speed, use his answer (which is really O(N)). — cmaster - reinstate monica, Mar 30 '14 at 10:20

Mark Byers · Answer 2 · 2011-10-01T06:11:34.463

8

If your inputs are all small integers you can use a counting sort which runs in O(n) time and requires O(m) space where m is the size of the range of possible inputs.

As a space optimization it is enough to use a bit array and use a single bit (rather than a count) to store whether you have seen that item before or not.

edited Oct 01 '11 at 06:11

answered Oct 01 '11 at 06:06

Mark Byers

811,555
193
1,581
1,452

1

doing so will give you which elements are duplicates. To get the elements in the original order: store which element are dupes in a bit-vector, and with another linear scan on the **original data**, output the dupe elements, still O(n), and gives you the elements in the desired order. – amit Oct 01 '11 at 07:30

score 3 · Answer 3 · answered Oct 01 '11 at 15:44

It sounds like you're adverse to allocating any additional space. Nonetheless, a hash table is still the right solution for speed. Honestly, most hash table implementations for simple data such as integers are so overweight from their one-solution-fits-all nature that I just roll my own depending on what I need. It can turn slow code into fast code when you need it for relatively little work.

Also, if your objection to hash tables is that they destroy order then perhaps you may want to use them a little differently to obtain expected O(n) while maintaining order:

Create a hash table that maps your array elements to two bits as a counting field from zero to three, and thirty bits as an index into the array of elements. Unless you've got over a billion values in your array, thirty bits is enough. That way your hash values are just a single 32-bit word.

Go through the elements in the array. If an element isn't in the table, insert the value into the hash table and set the count field to zero. It doesn't matter what the index portion is when you store it. If the element is in the table and the count field is zero, bump it up to 1 and store the element index with the new count field value. If the count field is already one or greater, set it to two and don't touch the stored index -- leave it as it is.

Go through the elements in the array again. Look up each element and if its index is the one stored and the associated count field is more than zero, print it out.

This should yield you what you want in the proper order with O(n) time. But, it uses hash tables which aren't desired for an unknown reason. I highly recommend you either accept a solution such as this one or explain the limitations so that you'll get a more accurately targeted solution.

score 1 · Answer 4 · answered Oct 01 '11 at 06:12

If you know the max value you can do like this,
have a separate array with the length as the max value

 int[max] secondarray;

    for(int i=o;i<arrayFirst.length;i++){
        if(secondarray[arrayFirst[i]]==0){
            secondarray[arrayFirst[i]]==arrayFirst[i];
         }else{
             result.add(arrayFirst[i]);
          }
     }

Igoy · Answer 5 · 2012-08-03T10:14:32.810

You can do this in O(n), this would however require the array to be integer. The space required for this can be though of the order size -2^32 to 2^32. What you'd need to do is find the max and min of the original array (arrayorig). Then make two arrays (arraynew+) and (arraynew-) .

The size of (arraynew+) will be max(arraorig)-min(arrayorig) if all values in arrayorig are +, else the size of (arraynew+) will be max(arrayorig).

The size (arraynew-) will be zero if all values are positive, else they will be equal to absolute value of min(arrayorig).

Then you can iterate over the arrayorig and increment the value by 1 of (arraynew-) or (arraynew+) at the index corresponding to the value of arraorig,if the value is positive increment should be done to (arraynew+) else if its negative increment should be done to (arraynew-) at the index of (arraynew-) which is equal to absolute value of arrayorig. Then all the indexes of (arraynew+) and ((arraynew-) with value >1 are the distinct values of arrayorig.

score 0 · Answer 6 · answered May 24 '13 at 19:46

0

 void printRepeating(int arr[], int size)
 {
 int i;
   printf("The repeating elements are: \n");
 for (i = 0; i < size; i++)
 {
 if (arr[abs(arr[i])] >= 0)
  arr[abs(arr[i])] = -arr[abs(arr[i])];
 else
  printf(" %d ", abs(arr[i]));
 }
  }

answered May 24 '13 at 19:46

Rajesh M

634
11
31

score -3 · Answer 7 · answered Oct 01 '11 at 06:16

-3

Finding duplicates is just as hard as sorting. Your best bet is exploiting some property of your input in order to get a O(N) sort.

answered Oct 01 '11 at 06:16

hugomg

68,213
24
160
246

Normally identifying duplicates requires a O(N^2) operation, but in this particular question, the integers must be in the range that can fit in the indexes of the array. You can exploit this property with a parlor trick. Pull a rabbit out of a hat by placing the numbers where they belong at the indexes, and identifying ones out of place. – Eric Leschinski Aug 29 '13 at 04:07

Find Duplicates in an array in O(N) time

7 Answers7