5

Suppose there is an array of elements which has no duplicates except for 1 number,

ex. 1,2,13,4,7,11,2,6

How to find the duplicate number in an efficient manner? we can do it using a hash table(HT) in O(n) time and with O(n) space like below.

if(HT.Contains(item)) -> this is the duplicate
else
ht.add(item)

Is there a better way in terms of both space and time complexity?

Note: this problem is not a duplicate of below two problems which are different.

If the integers are consecutive the solution in this link can be used how-to-find-a-duplicate-element-in-an-array-of-shuffled-consecutive-integers

If the array of n elements contains elements from 0 to n-1 only this link has solution Finding duplicates in O(n) time and O(1) space

Community
  • 1
  • 1
CRM
  • 1,349
  • 2
  • 12
  • 28
  • 1
    These numbers are small enough (and all >= 0) to fit as bits in a single short integer. Not a hash, then, but a *set* -- and, being a bit set, you can efficiently test each new item. – Jongware Sep 13 '14 at 10:36
  • How large is the array, and what is the range of the items in the array? Is it an arbitrarily large array of integers? Or is it a small array of small values? – Jim Mischel Sep 13 '14 at 12:37
  • @JimMischel It is arbitrarily large array and we cant afford to sort the array. – CRM Sep 13 '14 at 12:39
  • 2
    If you can't sort the array, then the hash table seems like the only reasonable solution. – Jim Mischel Sep 13 '14 at 12:47
  • 1
    @Falmarri: the question in your link restricts input to 0 to n-1 only which makes the problem easier. The one asked here is different – CRM Sep 14 '14 at 08:37

2 Answers2

2

I don't think you can do better than O(n) time complexity - in the worst case you're going to have to touch every element of the dataset to find the duplicate

One way to improve the space consumption (at the cost of requiring some bit twiddling as well as two passes over the dataset) is to use a Bloom Filter. The idea is to make a first pass over the dataset: if you find a possible duplicate then you remove it from the dataset and add it to a hash table (if the bloom filter functions correctly then only around 1% of the elements will be flagged as possible duplicates). Then make a second pass over the filtered dataset, testing elements against the small hash table of possible duplicates.

My code will be in Java since it's the language that I'm the most familiar with.

Class DupFinder {
  BloomFilter filter = new BloomFilter();
  HashTable hashTable = new HashTable();
  int start = 0;

  int run(int[] dataset) {
    // first pass
    for(int i = 0; i < dataset.length; i++) {
      if(filter.contains(dataset[i]) {
        // check if element is in hashTable, else add it
        if(hashTable.contains(dataset[i]) {
          return dataset[i]; // duplicate found
        } else {
          hashTable.add(dataset[i]);
        }

        // remove element from dataset
        int temp = dataset[start];
        dataset[start] = dataset[i];
        dataset[i] = temp;
        start++;
      } else filter.add(dataset[i]);
    } 

    // second pass
    for(int i = start; i < dataset.length; i++) {
      if(hashTable.contains(dataset[i]) {
        return dataset[i]; // duplicate found
      }
    }
    return NULL; // no duplicate found
  }
}

An alternative to your hash table is to use a Radix Sort, a linear time sorting algorithm. Radix sort will have better worst-case performance (O(n) for radix sort, compared to O(n^2) for the hash table in the unlikely event that you run into a ridiculous number of collisions) but worse average-case performance (the hash table implementation will usually find the duplicate after scanning only half of the dataset, whereas the radix sort will always require multiple passes over the dataset). Radix Sort will also be marginally more efficient in terms of space consumption if you use a space efficient data structure for the buckets, e.g. a chunked list.

You can parallelize the hash table implementation without incurring too much synchronization overhead. Using t threads, each thread will process n/t elements of the dataset (e.g. if you have 32 elements in the dataset and 2 threads, then thread0 processes elements 0-15 and thread1 processes elements 16-31), placing each element in a bucket with index absoluteValue(x modulo t). Following this, each thread will be responsible for processing all elements with a given bucket index, e.g. thread0 will process all buckets with index 0. I'm using a BlockingQueue for synchronization - this allows a thread to call take() on the queue, causing the thread to remove the first element of the queue or else to block until an element becomes available; all other data structures are thread-local. You'll need to initialize the dupFinders variable so that a DupFinder instance appears in the same index of every DupFinder's dupFinders variable (e.g. thread0 always appears in the 0th index, thus ensuring that all elements in its BlockingQueue have absoluteValue(x modulo t) == 0).

Class DupFinder implements Callable<Integer> {
  private Class Chunk {
    int size = 0;
    int chunk = new int[64];

    boolean add(int x) {
      if(size < 64) {
        chunk[size] = x;
        size++;
        return true;
      } else return false;
    }
  }

  int t = ??? // number of threads
  private BlockingQueue<Stack<Chunk>> queue = new LinkedBlockingQueue()
  private DupFinder[] dupFinders = new DupFinder[t];
  private Stack<Chunk>[] stacks = new Stack<Chunk>[t];

  void add(Stack<Chunk> stack) {
    queue.add(stack);
  }

  // the thread only receives n/t elements of the dataset
  int call(int[] partialDataset) {
    // partition dataset elements by their modulus(t)
    for(int i = 0; i < partialDataset.length; i++) {
      tempStack = stacks[Math.absoluteValue(partialDataset[i] modulo t)];
      if(!tempStack.peek().add(partialDataset[i])) {
        Chunk chunk = new Chunk();
        chunk.add(partialDataset[i]);
        tempStack.push(chunk);
      } 
    }

    // distribute chunk stacks to the appropriate threads
    for(int i = 0; i < t; i++) {
      dupFinders[i].add(stacks[i]);
    }

    HashTable hashTable = new HashTable();
    for(int i = 0; i < t; i++) {
      // wait for a chunk stack to become available
      Stack<Chunk> tempStack = queue.take();
      while(!tempStack.isEmpty) {
        tempChunk = tempStack.pop();
        for(int i = 0; i < tempChunk.size; i++) {
          if(hashTable.contains(tempChunk.chunk[i]) {
            return tempChunk.chunk[i]; // duplicate found
          } else {
            hashTable.add(tempChunk.chunk[i]);
          }
        }
      }
    }
    return NULL; // no duplicate found
  }
}
Zim-Zam O'Pootertoot
  • 17,888
  • 4
  • 41
  • 69
0

Operations on single bits are time consuming (it's like: get word, get/set 1 bit, set word), compare to word operations (get/set word).

If you know that MIN_VALUE >=0, know also the MAX_VALUE and it's sufficiently small you can do something like Jongware suggested - hash table, but not on bits: hashed values are simply that values.

#include <stdio.h>
#include <string.h>

#define MAX_VALUE 13 +1 // +1 so we don't have do -1 in for loop

main()
{
    int i;
    int array[] = { 1,2,13,4,7,11,2,6 };
    int array_size = sizeof(array) / sizeof(array[0]);

    short flags[MAX_VALUE] = { 0 };
    for (i = 0; i < array_size; ++i) {
          if (++flags[ array[i] ] != 1) {
              printf ("duplicated %d on %d\th position", array[i], i);
          }
    }
}

And it also don't require computing hash for each element.

Dawid
  • 623
  • 3
  • 9