First common number in both arrays of size 1 million

Question

I have two very large arrays of integer, each one is of size approximately 1 million. I have to find first integer which is present in both arrays.

I tried to do this using a set.

(1) Traverse each array simulatneously and insert elements of both array in set.

(2) Whenever set refuses to accept that is the first intersection.

int Solution(int A[], int B[])
{
    Set s = new HashSet();
    for (int i = 0 ; ; i++)
    {
        if ( i < A.length )
        {
            if( !s.Add(A[i]) )
                System.out.println(A[i]);
        }
        if ( i < B.length )
        {
            if( !s.Add(B[i]) )
                System.out.println(B[i]);
        }
    }
}

Can we improve this solution to reduce time complexity ?

Thanks

What's the first common number in arrays `{1, 2, 3}` and `{2, 1, 3}`? — Sergey Kalinichenko, Jun 16 '15 at 16:43
Thats only O(n) if I remember my algorithms correctly or O(1,000,000) in this case. Maybe you can do a non linear time sort then run a binary search — jgr208, Jun 16 '15 at 16:43
@RyanJ stole my idea haha! but yes I agree a sort and binary search would work great for reducing time complexity. — jgr208, Jun 16 '15 at 16:43
No arrays are not sorted and in case of A{1,2,3} B {2,1,3} 1 is the number because it's occur first in A. — anonymous, Jun 16 '15 at 16:46
@anonymous the point of dasblinkenlight's question is to get you to think about applying your algorithm to those arrays. See which number is actually returned using your logic... — Ryan J, Jun 16 '15 at 16:47
I guess there are no duplicates allowed in those arrays. Otherwise your algorithm also treats duplicates in the same array as solution. Also: You want to improve a algorithm that has `O(n)` expected time, but the result can't get better than `O(n)` — fabian, Jun 16 '15 at 16:55
Is there a min and max on your integers, or do they use the full range of `int` in Java? — Sergey Kalinichenko, Jun 16 '15 at 17:19
@anonymous you said that for `A{1,2,3} B {2,1,3}` **1** is the number because it's occur first in A but your algorithm result will be **2**. Can you explained what is the purpose for this search? — MaxZoom, Jun 16 '15 at 18:45
By 'first' do you mean first in one of the arrays, or first in numerical order? I'm willing to bet you mean first in numerical order. — DJClayworth, Jun 16 '15 at 19:45
@YvesDaoust: `The "first" element in both arrays is meaningless` - lacks/needs a definition. Lowest sum of ordinals/indices? — greybeard, Aug 15 '17 at 22:11

score 5 · Accepted Answer · answered Jun 16 '15 at 16:55

5

in case of A={1,2,3} B={2,1,3} 1 is the number because it's occur first in A

This means that your algorithm is not going to produce the right answer in some cases. Consider this data:

A = {1, 2, 3, 4, 5, 6, 7}
B = {7, 2, 3, 4, 5, 6, 1}

Your algorithm would return 2 instead of 1, because 2 would be detected after the second insertion into both sets, while you would need to iterate B to the end in order to detect 1.

One approach that is going to give you the right solution according to your spec is to load all elements of B into a hash set, and then iterate A until you get a hit in the set composed from numbers in B. This approach is O(N_a+N_b).

Set<Integer> bSet = new HashSet<Integer>();
for (int n : B) {
    bSet.add(n);
}
for (int n : A) {
    if (bSet.contains(n)) {
        return n;
    }
}
// If you get here, arrays have no elements in common

answered Jun 16 '15 at 16:55

Sergey Kalinichenko

714,442
84
1,110
1,523

It can be improved to use `O(min{N_a,N_b})` space, by loading only the smaller array to the hash table. It might be significant in cases where one array might be significantly smaller than the other one. – amit Jun 16 '15 at 17:05
@amit Switching arrays around may alter the result. It looks like OP has a requirement that in case of ties the number that occurs in `A` earlier must win. – Sergey Kalinichenko Jun 16 '15 at 17:12
By using a map value->index, and remembering the minimal index while iterating, it can still be done, with the same results. – amit Jun 16 '15 at 17:16
If *first* means *occuring first in A*, then populate the has table from `B` and iterate through `A` looking up values in the hash table. – chqrlie Jun 17 '15 at 18:02

score 2 · Answer 2 · 2015-06-17T10:41:58.257

2

Contrary to comments, sorting and binary search are not the most efficient.

Assuming that both arrays are of size N, a hash table will be filled and then used for detecting duplicates in time quasi O(N).

By constrast, sorting will take time O(N Lg(N)), and subsequent binary searches O(N Lg(N)) as well in the worst case.

Anyway, if your data is already sorted or can be sorted cheaply for some reason (bucket sort ?), don't use binary search, leading to O(N Log(N)), but merging, done in O(N).

Also, if the range of the integers is limited, say no more than 25 significant bits (like 0 to 33554431), it can be advantageous to use a bit array. It will require a space of 4MB (just like your million integers), and time O(N) for initialization and detection of duplicates, with very simple and fast code.

edited Jun 17 '15 at 10:41

answered Jun 16 '15 at 16:54

Sorting an array of arbitrary integers can be done in `O(N)` time with radix sort. Sort both arrays and merge them. You will find **all** common numbers in `O(N)` time. The first number found with this algorithm is the *smallest* common number, a possible interpretation of *The first number common to both arrays* – chqrlie Jun 16 '15 at 17:58
@chqrlie: yep, we are in line. I bet that the "first" requirement of the OP is just a bogus phrasing. – Jun 16 '15 at 18:35

score 1 · Answer 3 · answered Jun 16 '15 at 16:49

1

You can use a Merge sort which has worst time of n log n and then use a binary search which is worst case log n to get a total worst time of (sorry haven't done this math for a while so might be off) O(n log (log n^2))

answered Jun 16 '15 at 16:49

jgr208

2,896
9
36
64

1

Is that better than O(n)? – hatchet - done with SOverflow Jun 16 '15 at 16:53
I think the math boils down to `n log(n) + log (n) = log (n^(n+1))` – Ryan J Jun 16 '15 at 16:54
@RyanJ thanks! Forgot the exact way to do this, haven't had to compute a complexity for a while. – jgr208 Jun 16 '15 at 16:56
@hatchet well I remember a log function is normally slower growing then a n function. – jgr208 Jun 16 '15 at 16:56
@jgr208: You're right; `n` grows faster than `log n`. However `n * log n` grows faster than `n`. Also if you do merge sort on one array and then do binary search for every element of the other array the complexity is `O(n * log(n) + n * log(n)) = O(n*log (n))` and not `O(n log (log n^2)) = O(n * log (log (n)))`. Furthermore you can prove that you can't do better than `O(n)`: Assume the arrays have equal sizes and the last elements match. Every element has to be accessed at least once to make sure there is no duplicate before the last element, which is in `Ω(n)` – fabian Jun 16 '15 at 17:11
@fabian that is true and yea after i did that I was think about the Omega (high limit) so in that case a for loop is just as good as anything else. By the way, how did you insert the Omega symbol? – jgr208 Jun 16 '15 at 17:27

score 0 · Answer 4 · edited May 23 '17 at 11:58

Simple linear solution:

It can be done in O(n+m) time (on average) and O(n) space (n being the size of the first array, m being the size of the second array).

set = new empty hash set
for each x in arr2:
    set.add(x)
for each x in arr1 in ascending order:
    if set.contains(x):
        return x
//if here, no duplicates
return null

Slight improvement in memory consumption:

It can be improved to O(min{n,m}) space, by first checking which array is smaller, if the second- do the same algorithm as suggested, otherwise, load into the map (x,i) - the (element,index) pair, iterate the second list, and find the smallest i where there is a match, and return it:

Pseudo code for the approach with better memory complexity:

def secondSmaller(arr1,arr2):
    set = new empty hash set
    for each x in arr2:
        set.add(x)
    for each x in arr1 in ascending order:
        if set.contains(x):
            return x
    //if here, no duplicates
    return null
def firstSmaller(arr1,arr2):
    map = new empty hash map
    for each x in arr1 with index i:
        map.add(x,i)
    minimal = infinity
    minVal = null
    for each x in arr2:
         if set.contains(x):
         i = map.get(x)
         if i < minimal:
            minimal = i
            minVal = x
     return minVal
if arr1.size() > arr2.size():
     return secondSmaller(arr1,arr2)
else return firstSmaller(arr1,arr2)

Related thread: How do I calculate the delta (inserted/deleted/moved indexes) of two lists?

As a side note, this is closely correlates to the Element Distinctness Problem, and I doubt it can be done more efficiently than this, due to the lower bounds of element distinctness problem.

chqrlie · Answer 5 · 2015-06-17T14:50:14.667

0

Sorting an array of java 32-bit ints or more generally any fixed size integers can be done in O(N) time with radix sort. Sort both arrays and merge them. You will find all common numbers in O(N) time.

The first number found with this algorithm is the smallest common number, a possible interpretation of The first number common to both arrays

edited Jun 17 '15 at 14:50

answered Jun 16 '15 at 18:01

chqrlie

131,814
10
121
189

You have a major flaw with the logic of "Sorting integers is O(N)". You assume integers are bounded to be for some size, probably 2^64 or 2^32. Problem is, not many systems also supports arrays of size over 2^32 or 2^64, so in fact, the bound on the size of the integer, is usually not more than the bound of the size of the array, and assuming this is the case, there is no significant improvement with this assumption and technique. – amit Jun 17 '15 at 11:50
It will be more accurate to say radix sort is `O(nlogU)` (where `U` is the maximal absolute value of all elements). If the numbers are drawn i.i.d. over all ranges of `int`s, it is foolish to assume `U` is constant, and regarding to `logU` as constant, where the maximal supported value of `n` holds `n – amit Jun 17 '15 at 11:52
@amit: I agree with you in the general case, but given the context, the question being tagged `java`, the array of *integers* is just an array of java `int` or `long`. The OPs sample code explicitly uses `int A[]`, so in this particular case, Radix sort can be used efficiently and performs in linear time, thus has complexity `O(N)`. – chqrlie Jun 17 '15 at 14:48
This is specifically true to java, where array's maximum size is less than [Integer.MAX_VALUE](http://stackoverflow.com/a/3039805/572670), so saying radix sort is `O(n)` because `log(U) <= 32`, is same as saying other sorts are `O(n)` because `log(n) <= 31`, and just one step from claiming it's `O(1)` because `n < 2^31`. This is technically true, but not very helpful. – amit Jun 17 '15 at 16:39
@amit: I am afraid you are mistaken. Different methods for sorting arrays of specific objects do have different complexities. Sorting an array of random `bool` with `bubblesort` is `O(N**2)` no matter how you slice the values. Sorting the same array by counting the true and false values is obviously `O(N)`. Radix sorting an array of fixed size `int` is linear in the number of items in the set, it does have a fixed overhead that depends on the size of the `int` type. It can be implemented in most languages, nothing is java specific. – chqrlie Jun 17 '15 at 18:11
I am not mistaken, I am claiming saying radix sort is linear because `log(U) < =32`, when using 32 bits, while ignoring the fact that it is also guaranteed `log(n) < 32`, is foolish - so if you say O(nlog(U)) is linear, you must also regard O(nlogn) as such. – amit Jun 17 '15 at 18:14
@amit: in your informed opinion, how would you describe the complexity of radix sort for arbitrary arrays of 32 bit integers? If U is constant, O(N.log(U)) is the same complexity as O(N). Whereas O(N.log(N)) is not. – chqrlie Jun 17 '15 at 18:17
Like I said, you are technically correct, but the claim that sorting a java array is O(1) is also correct technically correct, since `n < 2^31`. Practically, IMO, you should not regard it as O(1), nor O(n), it should be O(nlogn) or O(nlogU), depending on the algorithm at hand. – amit Jun 17 '15 at 18:18
Or in other words, since `log(n) < log(U)` for uniform distribution of elements over all `int`s, `O(nlogU)` is "worse" than `O(nlogn)`, and if you regard the first as linear, you should also regard the 2nd as such. – amit Jun 17 '15 at 18:33
@amit: that's theory, in practice a good implementation of Radix Sort beats other methods easily for even medium size arrays of uniformly distributed `int`s. – chqrlie Jun 17 '15 at 18:43

First common number in both arrays of size 1 million

5 Answers5