11

I had an interview and there was the following question:

Find unique numbers from sorted array in less than O(n) time.

Ex: 1 1 1 5 5 5 9 10 10
Output: 1 5 9 10

I gave the solution but that was of O(n).

Edit: Sorted array size is approx 20 billion and unique numbers are approx 1000.

halfer
  • 19,824
  • 17
  • 99
  • 186
Deepu--Java
  • 3,742
  • 3
  • 19
  • 30

5 Answers5

21

Divide and conquer:

  • look at the first and last element of a sorted sequence (the initial sequence is data[0]..data[data.length-1]).
  • If both are equal, the only element in the sequence is the first (no matter how long the sequence is).
  • If the are different, divide the sequence and repeat for each subsequence.

Solves in O(log(n)) in the average case, and O(n) only in the worst case (when each element is different).

Java code:

public static List<Integer> findUniqueNumbers(int[] data) {
    List<Integer> result = new LinkedList<Integer>();
    findUniqueNumbers(data, 0, data.length - 1, result, false);
    return result;
}

private static void findUniqueNumbers(int[] data, int i1, int i2, List<Integer> result, boolean skipFirst) {

    int a = data[i1];
    int b = data[i2];

    // homogenous sequence a...a
    if (a == b) {
        if (!skipFirst) {
            result.add(a);
        }
    }
    else {
        //divide & conquer
        int i3 = (i1 + i2) / 2;
        findUniqueNumbers(data, i1, i3, result, skipFirst);
        findUniqueNumbers(data, i3 + 1, i2, result, data[i3] == data[i3 + 1]);
    }
}
Peter Walser
  • 15,208
  • 4
  • 51
  • 78
  • 3
    O(log n) is not the average case. It's O(log n) (or better) only when there's a relatively large amount of duplication. In the general case, it's O(n). – Jim Mischel Nov 16 '14 at 19:30
  • You're right, but this solution *can* find the unique numbers in less than O(n), as it doesn't necessarily need to look at all the numbers. Having no duplication is the worst case, not the average case. – Peter Walser Nov 16 '14 at 20:23
  • 1
    @PeterWalser I think your solution is better than others and fits with my edited question. Before accepting your answer let me check other inputs. Thanks. – Deepu--Java Nov 17 '14 at 05:24
  • @PeterWalser: we know very little about the distribution of the input. you cannot assert what the average case is or isn't. – Karoly Horvath Nov 22 '14 at 11:37
  • what about this case [1,2]? – Manish Kasera Jan 15 '15 at 20:00
  • 1
    I think as per the question asked by OP (20 billion numbers and only 1000 unique), this method works pretty well and retrieves unique elements in O (logn) time. In case of all unique elements, no algorithm is known to solve in less than O(n) anyway. – gaurav jain Aug 17 '15 at 17:39
16

I don't think it can be done in less than O(n). Take the case where the array contains 1 2 3 4 5: in order to get the correct output, each element of the array would have to be looked at, hence O(n).

DanielGibbs
  • 9,910
  • 11
  • 76
  • 121
  • I agree with you and I also gave the same answer but he told me that it's possible. That's why I am looking for the answer here because I haven't got yet how it's possible. – Deepu--Java Nov 16 '14 at 14:53
  • It's possible that the interviewer understood O(n) differently, or counted library functions as constant time. – DanielGibbs Nov 16 '14 at 14:55
  • You can make complexity as a function of something else (eg. distinct elements in the array). See my answer below. – ElKamina Nov 16 '14 at 19:56
5

If your sorted array of size n has m distinct elements, you can do O(mlogn).

Note that this is going to efficient when m << n (eg m=2 and n=100)

Algorithm:

Initialization: Current element y = first element x[0]

Step 1: Do a binary search for the last occurrence of y in x (can be done in O(log(n)) time. Let it's index be i

Step 2: y = x[i+1] and go to step 1

Edit: In cases where m = O(n) this algorithm is going to work badly. To alleviate it you can run it in parallel with regular O(n) algorithm. The meta algorithm consists of my algorithm and O(n) algorithm running in parallel. The meta algorithm stops when either of these two algorithms complete.

Shaishav Jogani
  • 2,111
  • 3
  • 23
  • 33
ElKamina
  • 7,747
  • 28
  • 43
0

Since the data consists of integers, there are a finite number of unique values that can occur between any two values. So, start with looking at the first and last value in the array. If a[length-1] - a[0] < length - 1, there will be some repeating values. Put a[0] and a[length-1] into some constant-access-time container like a hash set. If the two values are equal, you konow that there is only one unique value in the array and you are done. You know that the array is sorted. So, if the two values are different, you can look at the middle element now. If the middle element is already in the set of values, you know that you can skip the whole left part of the array and only analyze the right part recursively. Otherwise, analyze both left and right part recursively.

Depending on the data in the array you will be able to get the set of all unique values in a different number of operations. You get them in constant time O(1) if all the values are the same since you will know it after only checking the first and last element. If there are "relatively few" unique values, your complexity will be close to O(log N) because after each partition you will "quite often" be able to throw away at least one half of the analyzed sub-array. If the values are all unique and a[length-1] - a[0] = length - 1, you can also "define" the set in constant time because they have to be consecutive numbers from a[0] to a[length-1]. However, in order to actually list them, you will have to output each number, and there are N of them.

Perhaps someone can provide a more formal analysis, but my estimate is that this algorithm is roughly linear in the number of unique values rather than the size of the array. This means that if there are few unique values, you can get them in few operations even for a huge array (e.g. in constant time regardless of array size if there is only one unique value). Since the number of unique values is no grater than the size of the array, I claim that this makes this algorithm "better than O(N)" (or, strictly: "not worse than O(N) and better in many cases").

Michał Kosmulski
  • 9,855
  • 1
  • 32
  • 51
  • It seems you gave the solution for sequential entries but it can be 1 55 55 1000, like this also. – Deepu--Java Nov 16 '14 at 17:20
  • @DeepakTiwari The original question states that the array is sorted – Michał Kosmulski Nov 16 '14 at 17:48
  • @DeepakTiwari Yes, it's true that there is no gain in the case of 1, 55, 55, 1000 but we're looking at asymptotic behavior. If the number "55" was repeated 100 times in this sequence instead of just two, the complete set of values would require much fewer operations than 102, the length of the sequence. – Michał Kosmulski Nov 16 '14 at 18:53
0
import java.util.*;

/**
 * remove duplicate in a sorted array in average O(log(n)), worst O(n)
 * @author XXX
 */
public class UniqueValue {
    public static void main(String[] args) {
        int[] test = {-1, -1, -1, -1, 0, 0, 0, 0,2,3,4,5,5,6,7,8};
        UniqueValue u = new UniqueValue();
        System.out.println(u.getUniqueValues(test, 0, test.length - 1));
    }

    // i must be start index, j must be end index
    public List<Integer> getUniqueValues(int[] array, int i, int j) {
        if (array == null || array.length == 0) {
            return new ArrayList<Integer>();
        }
        List<Integer> result = new ArrayList<>();
        if (array[i] == array[j]) {
            result.add(array[i]);
        } else {
            int mid = (i + j) / 2;
            result.addAll(getUniqueValues(array, i, mid));

            // avoid duplicate divide
            while (mid < j && array[mid] == array[++mid]);
            if (array[(i + j) / 2] != array[mid]) {
                result.addAll(getUniqueValues(array, mid, j));
            }
        }
        return result;
    }
}
  • It is an interesting idea to use divide and conquer in this problem. My code is ready to run. The divide part is a little trick, you have to avoid duplicate element appear in both side. – Qiang Dai Aug 24 '16 at 05:27