Find the missing and duplicate elements in an array in linear time and constant space

Question

You’re given an array of N 64 bit integers. N may be very large. You know that every integer 1..N appears once in the array, except there is one integer missing and one integer duplicated.

Write a linear time algorithm to find the missing and duplicated numbers. Further, your algorithm should run in small constant space and leave the array untouched.

Source: http://maxschireson.com/2011/04/23/want-a-job-working-on-mongodb-your-first-online-interview-is-in-this-post/

I think you'll find this helpful: http://stackoverflow.com/questions/3492302/easy-interview-question-got-harder-given-numbers-1-100-find-the-missing-number — Jim Mischel, Apr 23 '11 at 21:12
@JIm: Actually that is different. Here we get equations of the form sum a^k - sum b^k = s, while that has sum a^k + sum b^k = s, allowing us to use Newtonian Identies. It does not seem obvious if we can use those Newtonian Identies here too. In fact, this seems more relevant: http://stackoverflow.com/questions/5249985/finding-missing-elements-in-an-array — , Apr 24 '11 at 20:35
@Moron: and that's why I said I thought he'd find it helpful, rather than voting to close the question as a duplicate. — Jim Mischel, Apr 25 '11 at 03:19
@Jim: I was saying that it might not be as helpful as one would think. Wasn't talking about dupe or anything. Anyway... — , Apr 25 '11 at 04:16

score 38 · Accepted Answer · answered Apr 23 '11 at 21:38

If all numbers were present in the array the sum would be N(N+1)/2.

Determine the actual sum by summing up all numbers in the array in O(n), let this be Sum(Actual).

One number is missing, let this be j and one number is duplicated, let this be k. That means that

Sum(Actual) = N(N+1)/2 + k - j

derived from that

k = Sum(Actual) -N(N+1)/2 + j

Also we can calculate the sum of squares in the array, which would sum up to n³/3 + n²/2 + n/6 if all numbers were present.

Now we can calculate the actual sum of squares in O(n), let this be Sum(Actual Squares).

Sum(Actual Squares) =n³/3 + n²/2 + n/6 + k² - j²

Now we have two equations with which we can determine j and k.

Nice analysis. I generalized it to m – DavidC Apr 24 '11 at 02:57 — DavidC, Apr 24 '11 at 02:57
with 'N maybe very large', this needs bigInt support – Abhinav Gauniyal Apr 18 '18 at 09:53 — Abhinav Gauniyal, Apr 18 '18 at 09:53

score 31 · Answer 2 · edited May 23 '17 at 10:29

The XOR trick works in two passes with a read-only array.

This avoids the problem of possible integer overflows which the sum and sum of squares solution has.

Let the two numbers be x and y, one of which is the missing number and the other repeated.

XOR all the elements of the array, along with 1,2,...,N.

The result is w = x XOR y.

Now since x and y are distinct, w is non-zero.

Pick any non-zero bit of w. x and y differ in this bit. Say the position of the bit is k.

Now consider a split of the array (and the numbers 1,2,...,N) into two sets, based on whether the bit at position k is 0 or 1.

Now if we compute the XOR (separately) of the elements of the two sets, the result has to be x and y.

Since the criteria for splitting is just checking if a bit is set of not, we can compute the two XORs of the two sets by making another pass through the array and having two variables, each of which holds the XOR of the elements seen so far (and 1,2,...N), for each set. At the end, when we are done, those two variables will hold x and y.

Finding missing elements in an array which can be generalized to m appearing twice and m missing.
Find three numbers appeared only once which is about three missing.

Is there a theory/ book write up which can elaborate about this? — user248884, Mar 25 '18 at 04:29

score 6 · Answer 3 · edited May 23 '17 at 11:47

6

Using the basic idea from a related interview question you could do:

Sum up all the numbers (we shall call this S1) and their squares (S2)
Compute the expected sum of the numbers, without modifications, i.e. E1 = n*(n+1)/2 and E2 = n*(n+1)*(2n+1)/6
Now you know that E1 - S1 = d - m and E2 - S2 = d^2 - m^2, where d is the duplicated number and m the missing one.

Solve this system of equations and you'll find that:

m = 1/2 ((E2 - S2)/(E1 - S1) - (E1 - S1))
d = 1/2 ((E2 - S2)/(E1 - S1) + (E1 - S1)) // or even simpler: d = m + (E1 - S1)

.

$S1 = $S2 = 0;
foreach ($nums as $num) {
    $S1 += $num;
    $S2 += $num * $num;
}

$D1 = $n * ($n + 1) / 2                - $S1;
$D2 = $n * ($n + 1) * (2 * $n + 1) / 6 - $S2;

$m = 1/2 * ($D2/$D1 - $D1);
$d = 1/2 * ($D2/$D1 + $D1);

edited May 23 '17 at 11:47

Community

1
1

answered Apr 23 '11 at 21:38

NikiC

100,734
37
191
225

Except it's not really `O(n)` if the numbers are `1..n` instead of `1..100`. – IVlad Apr 23 '11 at 21:48
2

@IVlad: Why isn't it? Looping through `n` values is `O(n)` isn't it? And the other computations are `O(1)`. – NikiC Apr 23 '11 at 21:51
@nikic - because the result of adding `n` 64-bit numbers is not necessarily a 64-bit number, so you need to write code to handle such big integers. That code will not run in constant time. Squaring `n` is also not `O(1)` - notice that `n` has no upper limit. Not that your solution is wrong or anything - I don't think it's possible to do it under the OP's constraints, I'm just saying it's not linear. – IVlad Apr 23 '11 at 22:03
2

@IVlad: I see no problem. You need a *constant* amount of 191 bits for the square sum and 128 bits for the normal sum. – NikiC Apr 23 '11 at 22:13
@IVlad the act of adding or multipling (squaring) a 64 bit number is constant time. The result of such an operation is at most a 128 bit number, and the result of adding all possible 128 bit numbers together is at most a 256 bit number. There is a constant upperbound of a 256 bit number, and since the time it takes to operate on those numbers is related to the length of the numbers, for which there is a known upper bound, it is a constant operation. – corsiKa Apr 23 '11 at 22:13
I think you can eliminate the need for >64 bits entirely using modular arithmetic... – R.. GitHub STOP HELPING ICE Apr 25 '11 at 13:26

score 5 · Answer 4 · answered Aug 05 '15 at 20:30

Here is a Java implementation based on @Aryabhatta 's idea:
Input:[3 1 2 5 3]
Output:[3, 4]

public ArrayList<Integer> repeatedNumber(final List<Integer> A) {
    ArrayList<Integer> ret = new ArrayList<>();
    int xor = 0, x = 0, y = 0;
    for(int i=0; i<A.size(); i++) {
        xor ^= A.get(i);
    }
    for(int i=1; i<=A.size(); i++) {
        xor ^= i;
    }

    int setBit = xor & ~(xor-1);
    for(int i=0; i<A.size(); i++) {
        if((A.get(i) & setBit) != 0) {
            x ^= A.get(i);
        } else {
            y ^= A.get(i);
        }
    }
    for(int i=1; i<=A.size(); i++) {
        if((i & setBit) != 0) {
            x ^= i;
        } else {
            y ^= i;
        }
    }

    for(int i=0; i<A.size(); i++) {
        if(A.get(i) == x) {
            ret.add(x);
            ret.add(y);
            return ret;
        } 

        if(A.get(i) == y) {
            ret.add(y);
            ret.add(x);
            return ret;
        }
    }

    return ret;
}

score 3 · Answer 5 · edited May 23 '17 at 11:55

3

The solution proposed by BrokenGlass covers the case for two unknowns (corresponding to one duplicate number and one missing number), using two formulas:

sum1

and

sum2

The formulas yield the generalized harmonic number of orders n of -1 and -2, respectively. (Power series)

This solution is generalizable for 3 unknowns by including the value of generalized harmonic number of order n of -3.

sum3

To solve for m unknowns (duplicates and missing numbers), use m generalized harmonic numbers of orders n of -1 through -m.

Moron notes that this approach was discussed earlier in StackOverflow at Easy interview question got harder.

edited May 23 '17 at 11:55

Community

1
1

answered Apr 24 '11 at 02:42

DavidC

3,056
1
20
30

Harmonic numbers are of the form 1+ 1/2 + 1/3 + ... + 1/n. – Apr 24 '11 at 03:00
@Moron I'm referring to generalized harmonic numbers of negative orders. (see http://en.wikipedia.org/wiki/Harmonic_number#Generalized_harmonic_numbers ) – DavidC Apr 24 '11 at 03:09
Those seem to be only defined only for m >= 1, but what you say makes sense. Here is another reference: http://mathworld.wolfram.com/PowerSum.html. See equation 11. – Apr 24 '11 at 03:28
@Moron The generalized harmonic number allows for m<0. Try running `Table[HarmonicNumber[10, -m], {m, 1, 7}]` in Mathematica. Or `Sum[p^k, {p, 1, n}]`. Thanks for the link to PowerSum, which also seems to fit. – DavidC Apr 24 '11 at 03:45
@David: I agree with you. It is well defined for all m. I misread the wiki page. – Apr 24 '11 at 04:38
btw, your answer is lacking the use of using [Newton's Identities](http://en.wikipedia.org/wiki/Newton%27s_identities) to form a polynomial whose roots we seek. Also, the generalization has already been treated here: http://stackoverflow.com/questions/3492302/easy-interview-question-got-harder-given-numbers-1-100-find-the-missing-number – Apr 24 '11 at 05:39
@Moron Thanks. I'll make a reference to the earlier discussion you cited. – DavidC Apr 24 '11 at 11:51
@David: Actually I just realized, Newton's identities don't directly apply... As we get equations of the form \sum (x_i)^k - \sum (y_i)^k = P_k. Groebner basis would be one way to go, or you can generalize my answer here: http://stackoverflow.com/questions/5249985/finding-missing-elements-in-an-array/5251995#5251995 – Apr 24 '11 at 15:42

score 3 · Answer 6 · answered Apr 27 '11 at 13:28

Taking the leave the array untouched requirement literally (i.e. the array can be temporarily modified as long as it does not change in the end), a programming-oriented solution can be suggested.

I assume that array size N is much smaller than 2^64 which is an utterly unrealistic amount of memory. So we can safely assume that N < 2^P such that P << 64 (significantly smaller). In other words this means that all numbers in the array have some high bits unused. So let's just use the highest bit as a flag whether the index of that position has been seen in the array. The algorithm goes as follows:

 set HIGH = 2^63  // a number with only the highest bit set
 scan the array, for each number k do
   if array[k] < HIGH: array[k] = array[k] + HIGH // set the highest bit
   else: k is the duplicate
 for each i in 1..N do
   if array[i] < HIGH: i is missing
   else: array[i] = array[i] - HIGH // restore the original number

This is linear time and very little computation

Bharat Arya · Answer 7 · 2019-05-30T17:57:54.860

1

    long long int len = A.size();
    long long int sumOfN = (len * (len+1) ) /2, sumOfNsq = (len * (len +1) *(2*len +1) )/6;
    long long int missingNumber1=0, missingNumber2=0;

    for(int i=0;i<A.size(); i++){
       sumOfN -= (long long int)A[i];
       sumOfNsq -= (long long int)A[i]*(long long int)A[i];
    }

    missingno = (sumOfN + sumOfNsq/sumOfN)/2;
    reaptingNO = missingNumber1 - sumOfN;

edited May 30 '19 at 17:57

answered May 30 '19 at 17:27

Bharat Arya

119
1
8

score -2 · Answer 8 · answered Apr 23 '11 at 21:16

-2

Psuedo code assuming the set is sorted

missing = nil
duplicate = nil

for i = 0, i < set.size - 1, i += 1
  if set[i] == set[i + 1]
    duplicate = set[i]
  else if((set[i] + 1) != set[i+1])
    missing = set[i] + 1
  if missing != nil && duplicate != nil
    break

return (missing, duplicate)

answered Apr 23 '11 at 21:16

dan.simpson

1

9

Almost certainly you *cannot* assume the array is sorted. – BrokenGlass Apr 23 '11 at 21:19

Find the missing and duplicate elements in an array in linear time and constant space

8 Answers8

Linked

Related