Faster algorithm to find unique element between two arrays?

Question

EDIT: For anyone new to this question, I have posted an answer clarifying what was going on. The accepted answer is the one I feel best answers my question as originally posted, but for further details please see my answer.

NOTE: This problem was originally pseudocode and used lists. I have adapted it to Java and arrays. So while I'd love to see any solutions that use Java-specific tricks (or tricks in any language for that matter!), just remember that the original problem is language-independent.

The Problem

Let's say that there are two unsorted integer arrays a and b, with element repetition allowed. They are identical (with respect to contained elements) except one of the arrays has an extra element. As an example:

int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};

Design an algorithm that takes as input these two arrays and outputs the single unique integer (in the above case, 7).

The Solution (So Far)

I came up with this:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    for (int i = 0; i < a.length; i++) {
        ret ^= a[i];
    }
    for (int i = 0; i < b.length; i++) {
        ret ^= b[i];
    }
    return ret;
}

The "official" solution presented in class:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    for (int i = 0; i < a.length; i++) {
        ret += a[i];
    }
    for (int i = 0; i < b.length; i++) {
        ret -= b[i];
    }
    return Math.abs(ret);
}

So, both are conceptually doing the same thing. And given that a is of length m and b is of length n, then both solutions have running time of O(m + n).

The Question

I later got to talking with my teacher and he hinted that there was an even faster way of doing it. Honestly I don't see how; to find out whether an element is unique it seems you'd have to at least look at every element. At that's at least O(m + n)...right?

So is there a faster way? And if so, what is it?

Using the official rules for calculating "order" it would seem impossible, on the face of it, to do any better than O(m+n). One could combine the two loops into one (do for the length of the shorter, then one external "iteration" for the longer), but the same number of array indexing operations, etc, would be performed -- only a bit of loop overhead would be saved. — Hot Licks, Oct 06 '13 at 00:04
Please try and bug your teacher to tell you the faster solution, then tell us - I really don't see how you could be faster than looking at each element exactly once. — G. Bach, Oct 06 '13 at 00:32
I can't think of anything faster than your solution. o_O....Granted I can't prove your solution is the fastest but honestly I think you have the fastest way to do it without cheating (like pre-storing part of the answer or storing in memory the underlying "sets" behind the arrays). So yes I would like to second @G.Bach 's comment. Please ask your teacher to give a "faster" solution that doesn't cheat by precomputing stuff. EDIT: Hot Licks made a good point that you can just condense the loops but it's still essentially the same algorithm. — Shashank, Oct 06 '13 at 02:44
The teacher's solution doesn't work if the arrays contain negative numbers. — Paul Hankin, Oct 06 '13 at 06:33
I really doubt that there is any way to improve the algorithmic complexity. Since `m = n + 1` then `O(n+m) --> O(2n+1) --> O(n)`. Since `n` is the input length in Big-O notation, algorithms cannot have a complexity less than `O(n)` unless they have some pre-conditioned input or data structure to work with. On the other hand, it may well be possible to optimize or improve on the *code*-efficiency, though I think that your approach is probably near-optimal. — RBarryYoung, Oct 06 '13 at 16:48
Hmm I was hoping there was some simple way I was just overlooking...guess not. Will ask tomorrow! — William Gaul, Oct 06 '13 at 19:03
@Anonymous Never noticed that, nice catch! Since it was never specified whether the integers were negative or positive I'm guessing that's a bug. — William Gaul, Oct 06 '13 at 19:04
repetition is allowed, in an un-ordered list? Also note that if you sort the 2 lists; the solution becomes simple. — Khaled.K, Oct 09 '13 at 07:58
@HotLicks: Actually making two loops out of this can possibly be the fastest way if the cpu supports SIMD instructions (http://en.wikipedia.org/wiki/SIMD) However, I am not up to date on how good java bytecode supports is on this. — Jens Timmerman, Oct 09 '13 at 08:13

score 28 · Accepted Answer · edited Jun 20 '20 at 09:12

This is probably the fastest you can do it in Java using HotLick's suggestion in the comments. It makes the assumption that b.length == a.length + 1 so b is the larger array with the extra "unique" element.

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    int i;
    for (i = 0; i < a.length; i++) {
        ret = ret ^ a[i] ^ b[i];
    }
    return ret ^ b[i];
}

Even if the assumption cannot be made, you can easily expand it to include the case where either a or b can be the larger array with the unique element. It's still O(m+n) though and only loop/assignment overhead is reduced.

Edit:

Due to details of language implementation, this is still (surprisingly) the fastest way to do it in CPython.

def getUniqueElement1(A, B):
    ret = 0
    for a in A: ret = ret ^ a
    for b in B: ret = ret ^ b
    return ret

I have tested this with the timeit module and found some interesting results. It turns out that the longhand ret = ret ^ a is indeed faster in Python than the shorthand ret ^= a. Also iterating over the elements of a loop is much much faster than iterating over the indexes and then making subscript operations in Python. That is why this code is much faster than my previous method where I tried to copy Java.

I guess the moral of the story is that there is no correct answer because the question is bogus anyways. As the OP noted in another answer below, it turns out you can't really go any faster than O(m+n) on this and his teacher was just pulling his leg. Thus the problem reduces to finding the fastest way to iterate over all elements in the two arrays and accumulating the XOR of all of them. And this means it's entirely dependent on language implementation, and you have to do some testing and playing around to get the true "fastest" solution in whatever implementation you are using, because the overall algorithm will not change.

Of course, this has the same number of array indexing ops and the same number of `^` ops as the original. Only loop overhead is reduced. — Hot Licks, Oct 06 '13 at 02:59
+1 I (thought of this too and) think it would be fastest. The important thing about using XOR over addition is you don't have to cater for integer overflow (in case the elements are large numbers). You may find the long hand `ret = ret ^ A[i] ^ B[i];` faster. The two are not exactly equivalent. — Bohemian, Oct 06 '13 at 04:09
I'm liking this answer best so far. It's only a small improvement but even that's something. Out of curiosity @Bohemian what is the advantage of the longhand? I thought ^= expands out to the longhand anyway when interpreted/compiled. — William Gaul, Oct 06 '13 at 19:16
The difference is casting. The shorthand expands out with a cast to the type of the variable. It allows you to use shorthand with mixed types, even larger ones (eg long) without an explicit cast. Although the compiler would probably (hopefully) omit the cast because the types are the same. I haven't checked the byte code. But the longhand is not going to be slower. — Bohemian, Oct 06 '13 at 20:33
In python, `return reduce (lambda x, y: x ^ y, A + B, 0)` should be enough. — thefourtheye, Nov 19 '13 at 03:45

score 14 · Answer 2 · answered Oct 08 '13 at 01:26

Alright here we go...apologies to anyone expecting a faster solution. It turns out my teacher was having a little fun with me and I completely missed the point of what he was saying.

I should begin by clarifying what I meant by:

he hinted that there was an even faster way of doing it

The gist of our conversation was this: he said that my XOR approach was interesting, and we talked for a while about how I arrived at my solution. He asked me whether I thought my solution was optimal. I said I did (for the reasons I mentioned in my question). Then he asked me, "Are you sure?" with a look on his face I can only describe as "smug". I was hesitant but said yeah. He asked me if I could think of a better way to do it. I was pretty much like, "You mean there's a faster way?" but instead of giving me a straight answer he told me to think about it. I said I would.

So I thought about it, sure that my teacher knew something I didn't. And after not coming up with anything for a day, I came here.

What my teacher actually wanted me to do was defend my solution as being optimal, not try to find a better solution. As he put it: creating a nice algorithm is the easy part, the hard part is proving it works (and that it's the best). He thought it was quite funny that I spent so much time in Find-A-Better-Way Land instead of working out a simple proof of O(n) that would have taken considerably less time (we ended up doing so, see below if you're interested).

So I guess, big lesson learned here. I'll be accepting Shashank Gupta's answer because I think that it does manage to answer the original question, even though the question was flawed.

I'll leave you guys with a neat little Python one-liner I found while typing the proof. It's not any more efficient but I like it:

def getUniqueElement(a, b):
    return reduce(lambda x, y: x^y, a + b)

A Very Informal "Proof"

Let's start with the original two arrays from the question, a and b:

int[] a = {6, 5, 6, 3, 4, 2};
int[] b = {5, 7, 6, 6, 2, 3, 4};

We'll say here that the shorter array has length n, then the longer array must have length n + 1. The first step to proving linear complexity is to append the arrays together into a third array (we'll call it c):

int[] c = {6, 5, 6, 3, 4, 2, 5, 7, 6, 6, 2, 3, 4};

which has length 2n + 1. Why do this? Well, now we have another problem entirely: finding the element that occurs an odd number of times in c (from here on "odd number of times" and "unique" are taken to mean the same thing). This is actually a pretty popular interview question and is apparently where my teacher got the idea for his problem, so now my question has some practical significance. Hooray!

Let's assume there is an algorithm faster than O(n), such as O(log n). What this means is that it will only access some of the elements of c. For example, an O(log n) algorithm might only have to check log(13) ~ 4 of the elements in our example array to determine the unique element. Our question is, is this possible?

First let's see if we can get away with removing any of the elements (by "removing" I mean not having to access it). How about if we remove 2 elements, so that our algorithm only checks a subarray of c with length 2n - 1? This is still linear complexity, but if we can do that then maybe we can improve upon it even further.

So, let's choose two elements of c completely at random to remove. There are actually several things that could happen here, which I'll summarize into cases:

// Case 1: Remove two identical elements
{6, 5, 6, 3, 4, 2, 5, 7, 2, 3, 4};

// Case 2: Remove the unique element and one other element
{6, 6, 3, 4, 2, 5, 6, 6, 2, 3, 4};

// Case 3: Remove two different elements, neither of which are unique
{6, 5, 6, 4, 2, 5, 7, 6, 6, 3, 4};

What does our array now look like? In the first case, 7 is still the unique element. In the second case there is a new unique element, 5. And in the third case there are now 3 unique elements...yeah it's a total mess there.

Now our question becomes: can we determine the unique element of c just by looking at this subarray? In the first case we see that 7 is the unique element of the subarray, but we can't be sure it is also the unique element of c; the two removed elements could have just as well been 7 and 1. A similar argument applies for the second case. In case 3, with 3 unique elements we have no way of telling which two are non-unique in c.

It becomes clear that even with 2n - 1 accesses, there is just not enough information to solve the problem. And so the optimal solution is a linear one.

Of course, a real proof would use induction and not use proof-by-example, but I'll leave that to someone else :)

Nice job with the `reduce` solution. :) One-line solutions are always neat to look at. — Shashank, Oct 08 '13 at 22:22
If one of the arrays has a unique value, you are looking for the element that occurs exactly once, not just an odd number of times. The fastest method is the member of the set of all O(n) methods which has the smallest coefficient factor. Now there is the possibility that you could find a case that was O(n-1). See if you can discover such a case, or set of cases. :) — Fred Mitchell, Oct 08 '13 at 22:27

score 7 · Answer 3 · answered Oct 05 '13 at 23:53

7

You can store the count of each value in a collection such as an array or hash map. O(n) then you can check the values of the other collection and stop as soon as you know you have a miss match. This could mean you only search half the second array on average.

answered Oct 05 '13 at 23:53

Peter Lawrey

525,659
79
751
1,130

1

It's worth noting that the overhead of a (hash) map would probably far outweigh the benefit (we know `abs(m-n) = 1`, and `n` map insertions is most likely slower than `2n` arithmetic operations). – Bernhard Barker Oct 06 '13 at 00:00
Of course, storing the count requires an array as large as the maximum value of the source array elements, and a hash map is rarely a true O(n) for insertions. – Hot Licks Oct 06 '13 at 00:06
1

Umm...this requires structural modification with O(m+n) overhead anyways. It's not really a solution. I'm fairly certain you can't go better than O(m+n) on this. – Shashank Oct 06 '13 at 02:42

A. I. Breveleri · Answer 4 · 2013-10-06T00:58:21.940

3

This is a little bit faster:

public static int getUniqueElement(int[] a, int[] b) {
    int ret = 0;
    int i;
    for (i = 0; i < a.length; i++) {
        ret += (a[i] - b[i]);
    }
    return Math.abs(ret - b[i]);
}

It's O(m), but the order doesn't tell the whole story. The loop part of the "official" solution has about 3 * m + 3 * n operations, and the slightly faster solution has 4 * m.

(Counting the loop "i++" and "i < a.length" as one operation each).

-Al.

edited Oct 06 '13 at 00:58

answered Oct 06 '13 at 00:50

A. I. Breveleri

747
5
6

Oops -- this is just a longer edition of what Hot Licks said. – A. I. Breveleri Oct 06 '13 at 01:00

score 1 · Answer 5 · answered Oct 06 '13 at 04:41

1

Assuming only one element was added, and the arrays were identical to start with, you could hit O(log(base 2) n).

The rationale is that any array is subject to searching binary-ly O(log n). Except that in this case you are not searching for a value in an ordered array, you are searching for the first non-matching element. In such a circumstance a[n] == b[n] means that you are too low, and a[n] != b[n] means that you might be too high, unless a[n-1] == b[n-1].

The rest is basic binary search. Check the middle element, decide which division must have the answer, and do a sub-search on that division.

answered Oct 06 '13 at 04:41

Edwin Buck

69,361
7
100
138

2

I think that the example given invalidates your assumption that array elements have the same order apart from the additional one. – Stephen C Oct 06 '13 at 04:54
1

@StephenC Thank you for the observation. I agree that the example doesn't fit the algorithm; however, if the professor really believes that there is a much faster way, then perhaps the professor's example differs from the student's example. – Edwin Buck Oct 06 '13 at 16:28
My belief is **EITHER** that the "teacher" is someone who doesn't have much real world understanding of tuning Java code, **OR** that the OP has misunderstood what the teacher has said ... and "much faster" is really "possibly a bit faster". Given the entire context (the problem, the constraints, the sample solution) it really doesn't make sense for the inputs to be pre-sorted. – Stephen C Oct 07 '13 at 00:07

Ken Kin · Answer 6 · 2013-10-06T09:54:13.863

1

Let's say that there are two unsorted integer arrays a and b, with element repetition allowed. They are identical (with respect to contained elements) except one of the arrays has an extra element ..

You may note that I emphasised two point in your original question, and I'm adding an extra assumption of that the values are non-zero.

In C#, you can do this:

int[, , , , ,] a=new int[6, 5, 6, 3, 4, 2];
int[, , , , , ,] b=new int[5, 7, 6, 6, 2, 3, 4];
Console.WriteLine(b.Length/a.Length);

See? Whatever the extra element is, you will always know it by simply dividing their length.

With these statements, we are not storing the given series of integers as values to arrays, but as their dimensions.

As whatever the shorter series of integers is given, the longer one should have only one extra integer. So no matter the order of the integers, without the extra one, the total size of these two multi-dimensional array are identical. The extra dimension times the size of the longer, and to divide by the size of the shorter, we know what is the extra integer.

This solution would works only for this particular case as I quoted from your question. You might want to port it to Java.

This is just a trick, as I thought the question itself is a trick. We definitely will not consider it as a solution for production.

edited Oct 06 '13 at 09:54

answered Oct 06 '13 at 08:37

Ken Kin

4,503
3
38
76

Wouldn't the ratio be 1 due to integer division? Shouldn't you get 7? I'm not sure I see why this works. – templatetypedef Oct 06 '13 at 08:41
It *is* 7. I revised for some elaboration. – Ken Kin Oct 06 '13 at 10:06
Oh, I see. But isn't this painfully inefficient due to the huge memory requirements? – templatetypedef Oct 06 '13 at 18:16
@templatetypedef: As it is just a *trick* .. the question of a trick brings the answers of a trick. – Ken Kin Oct 06 '13 at 18:36
@templatetypedef: By the way, the limitation of memory allocation is not mentioned within the question. For an algorithm question, I guess this is what the OP's teacher wants him to discover. – Ken Kin Oct 06 '13 at 18:49
Wouldn't allocating a massively multidimensional array take a lot of time because of all the small allocations required? – templatetypedef Oct 06 '13 at 18:54
@templatetypedef: It depends, by the underlying implementation. Microsoft has a lot of famous interview questions which are not really aimed to solve realistic problems .. – Ken Kin Oct 06 '13 at 18:59
Haha this *is* a nice trick! Would never want to try it out on a longer list though :P And yes, space complexity was never explicitly stated, but I'm assuming (since this is an algorithms class) we're not supposed to just ignore it either. – William Gaul Oct 06 '13 at 19:30
The compiler still has to do O(n) work to multiply all the dimensions together. Claiming otherwise is like saying `strchr` is one "operation". Still, if C# has `typedef`, use that instead of `new` to avoid actually allocating the memory. If something insists on zeroing it after getting it from the OS, you will be in trouble. If not, overcommit means it's no problem to allocate thousands of GB, as long as you don't touch it! Either way, multiplying/dividing instead of adding/subtracting will overflow fixed-size integers much sooner. – Peter Cordes Sep 11 '15 at 15:04

score 1 · Answer 7 · answered Oct 09 '13 at 07:14

Caution, it is wrong to use the O(n + m) notation. There is but one size parameter which is n (in the asymptotic sense, n and n+1 are equal). You should just say O(n). [For m > n+1, the problem is different and more challenging.]

As pointed by others, this is optimal as you must read all values.

All you can do is reducing the asymptotic constant. There is little room for improvement, as the obvious solutions are already very efficient. The single loop in (10) is probably hard to beat. Unrolling it a bit should improve (slightly) by avoiding a branch.

If your goal is sheer performance, than you should turn to non-portable solutions such as vectorization (using the AXV instructions, 8 ints at a time) and parallelization on multicores or GPGPU. In good old dirty C and a 64 bits processor, you could map the data to an array of 64 bit ints and xor the elements two pairs at a time ;)

Neeraj · Answer 8 · 2013-10-06T05:00:46.643

0

I think this is similar to Matching nuts and bolts problem.

You could achieve this possibly in O(nlogn). Not sure if thats smaller than O(n+m) in this case.

edited Oct 06 '13 at 05:00

answered Oct 06 '13 at 04:49

Neeraj

8,408
8
41
69

3

O(n+m) is equal to O(n) in this case, so O(nlogn) would be much slower. – Tesseract Oct 06 '13 at 04:53

score 0 · Answer 9 · answered Oct 06 '13 at 22:46

There simply is no faster algorithm. The ones presented in the question are in O(n). Any arithmetic "trick" to solve this will require at least each element of both arrays to be read once, so we stay in O(n) (or worse).

Any search strategy that is in a real subset of O(n) (like O(log n)) will require sorted arrays or some other prebuild sorted structure (binary tree, hash). All sorting algorithms known to mankind are at least O(n*log n) (Quicksort, Hashsort) at average which is worse than O(n).

Therefore, from a mathematical point of view, there is no faster algorithm. There might be some code optimizations, but they won't matter on large scale, as runtime will grow linear with the length of the array(s).