How to intersect two sorted integer arrays without duplicates?

Question

This is an interview question that I am using as a programming exercise.

Input: Two sorted integer arrays A and B in increasing order and of different sizes N and M, respectively

Output: A sorted integer array C in increasing order that contains elements that appear in both A and B

Contraints: No duplicates are allowed in C

Example: For input A = {3,6,8,9} and B = {4,5,6,9,10,11}, the output should be C = {6,9}

Thank you for your answers, all! To summarize, there are two main approaches to this problem:

My original solution was to keep two pointers, one for each array, and scanning the arrays from left to right interchangeably, while picking out elements that match. So when we the current element of one array is larger than the second array, we keep incrementing the pointer of the second array until we either find the current first array element or overpass it (find one larger). I keep all matched in a separate array, which is returned once we reach the end of either one of the input arrays.

Another way that we could do this is to scan one of the arrays linearly, while using binary search to find a match in the second array. This would mean O(N*log(M)) time, if we scan A and for each of its N elements binary search on B (O(log(M)) time).

I've implemented both approaches and ran an experiment to see how the two compare (details on this can be found here). The Binary Search method seems to win when M is roughly 70 times larger than N, when N has 1 million elements.

Just because one array is bigger, does not mean combining both arrays will result in the same size. — Nahydrin, Feb 10 '12 at 17:58
@BrianGraham OP is creating new array with appropriate size at end of the program so this should not be a problem. — Ashwinee K Jha, Feb 10 '12 at 18:01
@AKJ If I have `[0, 1, 2, 3, 4]` and `[5, 6, 4, 9, 8]`, the resulting intersection is larger than how he's determing the size; resulting in missing values. — Nahydrin, Feb 10 '12 at 18:04
@BrianGraham the size is determined by index ci, which increments only when equal values are detected. — Artur Galiullin, Feb 10 '12 at 18:11
You both don't see the problem, I'm amazed at that. I'll move this to an answer... — Nahydrin, Feb 10 '12 at 18:15
@BrianGraham I've added your test case in my edit (only the second array input is sorted, as per the spec) — Artur Galiullin, Feb 10 '12 at 18:17
The right answer is to just slam the vales into a TreeMap then produce the output array using a getter. The solution above is what a C programmer would do. What happens when the next interview question is, "Same question, but what if the input lists are not sorted?" The TreeMap solution requires no change. — Tony Ennis, Feb 10 '12 at 18:48
@TonyEnnis The question asks to intersect the arrays, not merge them, I'm sorry if that wasn't clear (I will include an example); I'm not sure how a TreeMap can help in that regard. Also, just inserting all elements into a TreeMap (a Red-Black tree) will increase the complexity to O((M+N)*log(M+N)), where M and N are sizes of A and B respectively. Please correct me if I'm wrong. — Artur Galiullin, Feb 10 '12 at 20:16
See also: [How to intersect two sorted arrays the fastest possible way?](https://stackoverflow.com/q/42538902) — greybeard, Jan 11 '23 at 14:25

NPE · Answer 1 · 2012-02-10T19:45:56.227

6

How about:

public static int[] intersectSortedArrays(int[] a, int[] b){
    int[] c = new int[Math.min(a.length, b.length)]; 
    int ai = 0, bi = 0, ci = 0;
    while (ai < a.length && bi < b.length) {
        if (a[ai] < b[bi]) {
            ai++;
        } else if (a[ai] > b[bi]) {
            bi++;
        } else {
            if (ci == 0 || a[ai] != c[ci - 1]) {
                c[ci++] = a[ai];
            }
            ai++; bi++;
        }
    }
    return Arrays.copyOfRange(c, 0, ci); 
}

Conceptually it's similar to yours, but contains a number of simplifications.

I don't think you can improve on the time complexity.

edit: I've tried this code, and it passes all of your unit tests.

edited Feb 10 '12 at 19:45

answered Feb 10 '12 at 17:58

NPE

486,780
108
951
1,012

@aix I dont see a loop here. Also what if indices goes out of array length. – Ashwinee K Jha Feb 10 '12 at 18:04
@AKJ: There was no loop in the very first version I posted (due to a bizarre copy-and-paste error on my part). I've fixed that a while ago. If you refresh, you should see the correct version. – NPE Feb 10 '12 at 18:06
@AKJ: I am not sure what you mean by "indices goes out of array length". `ai` and `bi` are constrained by the `while` condition, and `c` is large enough by construction. – NPE Feb 10 '12 at 18:15
@aix, I was looking at the code without while loop hence the confusion. – Ashwinee K Jha Feb 10 '12 at 18:34
I like this. Signficiant compression without loss of readibility. Thanks! – Artur Galiullin Feb 10 '12 at 20:28
can't get `int[] c = new int[Math.min(a.length, b.length)];`. shouldn't the merged array size be equal to `a.length + b.length`? Shouldn't the array indices go out of bound in this case? – Muhammad Adeel Zahid Feb 07 '17 at 20:42

Tim Gee · Accepted Answer · 2012-02-11T02:23:58.587

This problem essentially reduces to a join operation and then a filter operation (to remove duplicates and only keep inner matches).

As the inputs are both already sorted, the join can be efficiently achieved through a merge join, with O(size(a) + size(b)).

The filter operation will be O(n) because the output of the join is sorted and to remove duplicates all you have to do is check if the each element is the same as the one before it. Filtering only the inner matches is trivial, you just discard any elements that were not matched (the outer joins).

There are opportunities for parallelism (both in the join and filter) to achieve better performance. For example the Apache Pig framework on Hadoop offers a parallel implementation of a merge join.

There are obvious trade-offs between performance and complexity (and thus maintainability). So I would say a good answer to a interview question really needs to take account of the performance demands.

Set based comparison - O(nlogn) - Relatively slow, very simple, use if there are no performance concerns. Simplicity wins.
Merge join + Filter - O(n) - Fast, prone to coding error, use if performance is an issue. Ideally try to leverage an existing library to do this, or perhaps even use a database if appropriate.
Parallel Implementation - O(n/p) - Very fast, requires other infrastructure in place, use if the volume is very large and anticipated to grow and this is a major performance bottleneck.

(Also note that the function in the question intersectSortedArrays is essentially a modified merge join, where the filter is done during the join. You can filter afterwards at no performance loss, although a slightly increased memory footprint).

Final thought.

In fact, I suspect most modern commercial RDBMSs offer thread parallelism in their implementation of joins, so what the Hadoop version offers is machine-level parallelism (distribution). From a design point of view, perhaps a good, simple solution to the question is to put the data on a database, index on A and B (effectively sorting the data) and use an SQL inner join.

Very nice connection -- I can see now how this problem is relevant in the context of DBMS (and probably most prevalent). — Artur Galiullin, Feb 13 '12 at 00:12

Periastron · Answer 3 · 2012-02-12T02:33:59.683

3

Using arraylist to store result.

public ArrayList<Integer> arrayIntersection(int [] a, int[] b)
{
    int len_a=a.length;
    int len_b=b.length;
    int i=0;
    int j=0;
    ArrayList<Integer> alist=new ArrayList();

    while(i<len_a && j<len_b)
    {
        if(a[i]<b[j])
            i++;
        else if(a[i]>b[j])
            j++;
        else if(a[i]==b[j])
        {
            alist.add(a[i]);
            i++;
            j++;

        }
    }

   return alist;    
  }

edited Feb 12 '12 at 02:33

answered Feb 12 '12 at 02:19

Periastron

663
2
10
15

Just to clarify to new readers, the result in this solution may contain duplicated values. – alemures Mar 09 '17 at 12:31

Pavel Kosarev · Answer 4 · 2023-01-10T21:41:13.417

public static int[] getIntersectionOfSortedArrays(int[] numbers1, int[] numbers2) {
    var size1 = numbers1.length;
    var size2 = numbers2.length;

    var elementsCount = Math.min(size1, size2);
    var result = new int[elementsCount];

    var i1 = 0;
    var i2 = 0;
    var index = 0;

    while (i1 < size1 && i2 < size2) {
        if (numbers1[i1] == numbers2[i2]
                && (index == 0 ||  numbers1[i1] != result[index - 1])) {
            result[index] = numbers1[i1];
            i1++;
            i2++;
            index++;
        } else if (numbers1[i1] > numbers2[i2]) {
            i2++;
        } else {
            i1++;
        }
    }

    return Arrays.copyOf(result, index);
}

What would make the code presented noteworthy? – greybeard Jan 11 '23 at 09:35 — greybeard, Jan 11 '23 at 09:35

score 0 · Answer 5 · answered Feb 11 '12 at 01:21

I don't know if it is a good idea to solve the problem in this way:

say

  A,B are 1 based arrays
    A.length=m
    B.length=n

1) init an array, C, with min(m,n) length

2) only focus on the common part by checking the first and last element. here binary search could be used. take an example to save some words:

 A[11,13,15,18,20,28,29,80,90,100.........300,400]
    ^                                          ^
 B[3,4,5,6,7.8.9.10.12,14,16,18,20,..400.....9999]
                     ^                ^


then we need only focus  on

    A[start=1](11)-A[end=m](400)
    and
    B[start=9](12)-B[end](400)

3). compare the range (end-start) of both Arrays. taking the array with smaller range , say A, for each element A[i] from A[start] ~ A[end], do binary search in B[start,end],

if found, put element in C, reset B.start to foundIdx+1,
otherwise B.start is set to the smallest element [j], which B[j] is greater than A[i], to narrow the range

4) continue 3) untill all elements in A[start, end] was processed.

by step 1, we could find the case if there is no intersection between two Array.
when doing binary search in step 3, we compare A[i] with A[i-1], if same, skip A[i]. in order to keep elements in C are unique.

in this way, worse case would be lg(n!) if(A and B are same) ? not sure.

Avg case?

DRobinson · Answer 6 · 2012-02-13T12:42:56.970

0

Here's a memory improvement:

It would be better to store your results (C) in a dynamic structure, like a linked list, and create an array after you're done finding the intersecting elements (exactly as you do with array r). This technique would be especially good if you have very large arrays for A and B and expect the common elements to be few in comparison (why search for a huge chunk of contiguous memory when you only need a small amount?).

EDIT: one more thing that I would change, and this might be just a little nit-picky, is that I would avoid using unbound loops when the worst case number of iterations is known before hand.

edited Feb 13 '12 at 12:42

answered Feb 11 '12 at 16:21

DRobinson

4,441
22
31

Isn't Big Theta a tighter bound than Big Oh? I think in my solution, the worst case is asymptotically equivalent to the best case, hence I used Big Theta. I found an interesting SO discussion [here](http://stackoverflow.com/questions/471199/what-is-the-difference-between-n-and-on). – Artur Galiullin Feb 13 '12 at 00:19
Eek, I'm sorry I was a little tired and had read Theta as Omega (not word for word, but in meaning). You're absolutely correct, I've edited my post. That said, the main point of the post was to explain that using a dynamic data structure would be a very good idea because you don't need full search and you're parsing it to a new array at the end anyway. – DRobinson Feb 13 '12 at 12:42

bchetty · Answer 7 · 2012-02-10T23:55:24.547

-1

If you are using 'Integer' (object) arrays and would like to use the java API methods, you can check the below code. Note that the below code probably has more complexity (as it uses some conversion logic from one datastructure to other) and memory consumption (because of using objects) than the primitive method, as listed above. I just tried it (shrugs):

public class MergeCollections {
    public static void main(String[] args) {
        Integer[] intArray1 = new Integer[] {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
        Integer[] intArray2 = new Integer[] {2, 3, 5, 7, 8, 11, 13};

        Set<Integer> intSet1 = new TreeSet<Integer>();
        intSet1.addAll(Arrays.asList(intArray1));
        intSet1.addAll(Arrays.asList(intArray2));
        System.out.println(intSet1);
    }
}

And the output:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13]

Also, check this link: Algolist - Algo to merge sorted arrays

EDIT: Changed HashSet to TreeSet

EDIT 2: Now that the question is edited and clear, I'm adding a simple solution to find intersection :

public class Intersection {
    public static void main(String[] args) {
        Integer[] intArray1 = new Integer[] {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
        Integer[] intArray2 = new Integer[] {2, 3, 5, 7, 8, 11, 13};

        List<Integer> list1 = Arrays.asList(intArray1);
        Set<Integer> commonSet = new TreeSet<Integer>();
        for(Integer i: intArray2) {
            if(list1.contains(i)) {
                commonSet.add(i);
            }
        }

        System.out.println(commonSet);
    }
}

edited Feb 10 '12 at 23:55

answered Feb 10 '12 at 18:49

bchetty

2,231
1
19
26

This is more idiomatic though a TreeSet (etc) may be nice to use. – Tony Ennis Feb 10 '12 at 18:52
Also, a bit idiotic (specially if someone is trying to learn algorithms). :) – bchetty Feb 10 '12 at 18:55
Tony, tried to post a fast solution and forgot that. I edited the code to use TreeSet. Thanks for the suggestion. :) – bchetty Feb 10 '12 at 19:02
Your solution is a *far* better interview answer. I would probably reject an interviewee who answered with the "C-like" solution. – Tony Ennis Feb 10 '12 at 19:42
Actually, the question asks to intersect the arrays, not merge them. – Artur Galiullin Feb 10 '12 at 20:03
1

-1 for reading comprehension for me. Was that changed after the question was originally posted? – Tony Ennis Feb 10 '12 at 20:11
@TonyEnnis No, but I just added an example for clarity. – Artur Galiullin Feb 10 '12 at 20:19
I think if the question tagged as algorithm, the high level api like list.contains(e) might be not allowed to be used... otherwise, I would suggest guava or apache commons, with an oneliner to get the thing. ;) – Kent Feb 11 '12 at 01:35

How to intersect two sorted integer arrays without duplicates?

7 Answers7

Linked