19

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays

  • [2, 3, 5, 8, 13]
  • [4, 8, 12, 16]

The pairs with largest sums are

  • 13 + 16 = 29
  • 13 + 12 = 25
  • 8 + 16 = 24
  • 13 + 8 = 21
  • 8 + 12 = 20

So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?

Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.


I am already aware of the O(Klogk) solution , using Max-Heap given here .

It also is one of the favorite Google interview question , and they demand a O(k) solution .

I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .

Can someone explain the correct solution with a pseudocode .

P.S. Please DON'T post this link as answer/comment.It DOESN'T contain the answer.

Community
  • 1
  • 1
Spandan
  • 2,128
  • 5
  • 25
  • 37
  • 4
    You can find a linear-time algorithm in this pdf: ["Selection in X + Y and matrices with sorted rows and columns"](http://www.cse.yorku.ca/~andy/pubs/X%2BY.pdf). – Evgeny Kluev Sep 01 '13 at 10:02
  • Your description of the problem is rather vague. Is it the sum of two elements with the same index? Do the two arrays have equal amount of elements? – Nikolai Ruhe Sep 01 '13 at 10:06
  • 1
    @EvgenyKluev the algorithm described in pdf is **O(n)**, not exactly **O(k)**, and it is only fit for same length M and N – hs3180 Sep 01 '13 at 13:18
  • 4
    @hs3180: Yes, this algorithm is O(n), which is better then the requested O(k). If k – Evgeny Kluev Sep 01 '13 at 13:42
  • possible duplicate of [Selection algorithms on sorted matrix](http://stackoverflow.com/questions/5000836/selection-algorithms-on-sorted-matrix) – David Eisenstat Sep 01 '13 at 14:25
  • Here is the solution which find the top N sums in O(N) time: [link](http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi?board=riddles_cs;action=display;num=1132204952;start=25#47). And I think it can be modified to solve you problem. – fuch Sep 02 '13 at 08:46
  • "I've also read somewhere that there exists a O(k) solution using two pointers " I seriously doubt that multiple sets of authors missed a solution so simple that this description would do it justice -- there's an easy way to use a deterministic algorithm for this problem to obtain a deterministic selection algorithm on an unsorted 1D array. – David Eisenstat Sep 03 '13 at 14:20
  • Just to clarify, your are looking for O(k) solution, where k – Maggie Sep 03 '13 at 18:18
  • @EvgenyKluev: sounds like you have a perfect solution, covering all the edge cases. just post an answer;) – Karoly Horvath Sep 03 '13 at 18:57
  • @Spandan please clarify: are you looking for ONLY the sum, or you need to display all possible pair combinations? – Maggie Sep 04 '13 at 16:58
  • @Maggie: Only the sum would be fine :) – Spandan Sep 04 '13 at 22:06
  • wouldn't `O(k)` prevent you from reading all the input? if the input is given as a parameter, is it already sorted? `O(k)` also doesn't make sense because finding the 5th element vs finding the 95th element in an input of 100 elements should have the same complexity, and i'd think it would be the same for the 50th element too. – necromancer Sep 06 '13 at 06:45
  • @necromancer: O(k) does indeed prevent reading all the input when k < n, which can certainly happen. But that only creates problems if we *need* to read them all -- and just as with O(log n) binary search on an already-sorted length-n array, that happens to not be the case here, since enough elements can be proven not to need reading. – j_random_hacker Aug 21 '15 at 16:03

7 Answers7

14

I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.

Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?

We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.

This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.

If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.

I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).


A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.

OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.

Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).

Evgeny Kluev
  • 24,287
  • 7
  • 55
  • 98
0

EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below. A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).


Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.

The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).

Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].

for k from 1 to K-1, do:

  • if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1

To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change. If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].

Therefore, at the end of the loop you found the K-th greater sum.

Nicolas Grebille
  • 1,332
  • 8
  • 15
  • 1
    This is pretty much the same as the **flawed solution** I posted a couple of days ago. Part of the problem is that it does not generate all possible sums - so by force of logic won't always find the Kth largest sum. Consider arrays {2, 10} and {1, 2} there are 4 possible sums here {3, 4, 11, 12}. Your algorithm will only generate 3 of them {3, 4, 12}. Originally I left my wrong answer posted with some discussion as to what was wrong with it to help others, like yourself, not make the same mistake. However, after attracting down votes I decided to remove it. – NealB Sep 06 '13 at 16:18
  • I don't follow. Step 1: (10, 2) -> 12. Step 2: I compare (2,2)->4 and (10, 1)->11 by decreasing each index, I choose (10, 1)->11. Why would you say the algorithm cannot generate 11? – Nicolas Grebille Sep 06 '13 at 18:00
  • The number of sums that can be generated from arrays of size N and M is N*M when all pairs are considered. Start your algorithm with i=0; j=0 then limit i to N-1 and j to M-1. Notice your algorithm increments either i or j on each iteration of sum calculation. This means that at most N+M-1 sums can be generated. Now consider that (N*M)>=(N+M-1) for all values of N and M > 0. Obviously not all sums are calculated, one of the non-generated sums could be Kth largest. – NealB Sep 06 '13 at 19:29
  • The size of the arrays does not matter as long as the arrays are sorted. The algorithm generate exactly `K` sums, which are the `K` largest sums in the arrays (because these are all different, and _none_ of the non-generated sums can be larger, by construction). Note that the OP asks for an `O(K)` algorithm, which clearly cannot try the `N*M` possible sums! – Nicolas Grebille Sep 06 '13 at 20:18
  • And so what happens when I choose K to be N*M? – NealB Sep 06 '13 at 21:05
  • @NealB Got it, thanks. I leave the answer with an explanation, in case somebody would have the same (wrong) idea. – Nicolas Grebille Sep 06 '13 at 23:44
  • This is almost right - in addition to looking ahead, you need to look behind at the previous iteration also. – Marcin Sep 07 '13 at 01:34
  • @Marcin For one of the arrays, you need to keep track of the index of where it was last used in a reported sum using the other array, but as soon as you do that the algorithm no longer an O(k) solution so doesn't answer the question. – NealB Sep 09 '13 at 15:46
0

tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.

Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).


Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].

Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.

Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).

At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.

So, let's code (python, should be 2/3 compatible):

def kth(a,b,k):
    M = len(a)
    N = len(b)
    if k > M*N:
       raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
    (ia,ib) = M-1,N-1 #0 based arrays
    # we need this for lookback
    nottakenindices = (0,0) # could be any value
    nottakensum = float('-inf')
    for i in range(k-1):
        optionone = a[ia]+b[ib-1]
        optiontwo = a[ia-1]+b[ib]
        biggest = max((optionone,optiontwo))
        #first deal with look behind
        if nottakensum > biggest:
           if optionone == biggest:
               newnottakenindices = (ia,ib-1)
           else: newnottakenindices = (ia-1,ib)
           ia,ib = nottakenindices
           nottakensum = biggest
           nottakenindices = newnottakenindices
        #deal with case where indices hit 0
        elif ia <= 0 and ib <= 0:
             ia = ib = 0
        elif ia <= 0:
            ib-=1
            ia = 0
            nottakensum = float('-inf')
        elif ib <= 0:
            ia-=1
            ib = 0
            nottakensum = float('-inf')
        #lookahead cases
        elif optionone > optiontwo: 
           #then choose the first option as our next pair
           nottakensum,nottakenindices = optiontwo,(ia-1,ib)
           ib-=1
        elif optionone < optiontwo: # choose the second
           nottakensum,nottakenindices = optionone,(ia,ib-1)
           ia-=1
        #next two cases apply if options are equal
        elif a[ia] > b[ib]:# drop the smallest
           nottakensum,nottakenindices = optiontwo,(ia-1,ib)
           ib-=1
        else: # might be equal or not - we can choose arbitrarily if equal
           nottakensum,nottakenindices = optionone,(ia,ib-1)
           ia-=1
        #+2 - one for zero-based, one for skipping the 1st largest 
        data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
        narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
        print (narrative) #this will work in both versions of python
        if ia <= 0 and ib <= 0:
           raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
    return data, narrative

For those without python, here's an ideone: http://ideone.com/tfm2MA

At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.

Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.


Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):

import itertools
def refkth(a,b,k):
    (rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
    data = k,righta,rightb,righta+rightb,rightia,rightib
    narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
    print (narrative) #this will work in both versions of python
    return data, narrative

This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • I see two flaws in this code. (1) You never do any bounds checking. After, for example, `ia-=1` you could easily have `ia` equal to -1. Python conveniently restarts walking the same array from the end. But that's obviously not intended. (2) If `nottakensum > biggest` you put one of the "lookahead" positions to `nottakenindices` and completely ignore other one, also in other two options you just overwrite `nottakenindices` without using its previous value. So in all 3 cases you drop some position which (in correct implementation) should be considered later. – Evgeny Kluev Sep 07 '13 at 10:33
  • @EvgenyKluev (1) Fair point (2) These are correct observations about what my code does, but they do not disclose an error. Would you care to provide an example of a case when this is a mistake? – Marcin Sep 07 '13 at 12:01
  • 1
    `a=[1,2,3] b=[10,20]`. After this code finds 3 correct sums (23, 22, 21) you get no more "lookahead" positions because of going out-of-bounds and only (0,0) position for `nottakenindices`. So fourth sum found by this code is 11, while 13 and 12 are lost. – Evgeny Kluev Sep 07 '13 at 12:12
  • @EvgenyKluev Thanks. I'll consider later. – Marcin Sep 07 '13 at 15:58
  • It's not enough to look 1 element behind only. Consider two arrays with `{5, 50, 70, 80, 90, 100}`: the first pair smaller than `50+50` is `90+5`. `O(k)` seems very optimistic IMHO. – vgru Sep 08 '13 at 06:40
  • @Groo Indeed. I need to rethink this. – Marcin Sep 08 '13 at 11:41
0

If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).

(No it's not, still not sure whether that's possible.)

clwhisk
  • 1,805
  • 1
  • 18
  • 17
  • 1
    Looking one element behind is not enough IMHO. Consider two arrays with `{5, 50, 70, 80, 90, 100}`: the first pair smaller than `50+50` is `90+5`, for example. – vgru Sep 08 '13 at 07:04
  • Right. I think the trick is in only having to look O(k) elements behind, total. In this example 5+90 is the 26th highest pair and finding it requires more work than most of the others. I was having trouble capturing that line of reasoning and wrote this poor attempt! – clwhisk Sep 08 '13 at 17:54
  • What did you mean by looking `O(k)` elements behind? I don't see why the choice of `k` should affect the algorthm needed to order pairs. Did you mean, go all the way back to the largest element on each iteration? – vgru Sep 08 '13 at 20:54
  • Go back one element at a time until you've gone back further than, in this case, 45. Then is the total number times you need to go back (the sum of the count of these operations over finding all k highest sums) bounded by a multiple of k? – clwhisk Sep 08 '13 at 22:52
0

The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795

Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.

Community
  • 1
  • 1
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • Actually, I think the heap thing is O(k log n) -- where n is the length of the shorter of the two vectors. Since n is constant for any given pair of vectors, that's effectively O(k). See: [answer to later, related question](http://stackoverflow.com/a/25319309/3793679). –  Aug 17 '14 at 12:39
0
[2, 3, 5, 8, 13]
[4, 8, 12, 16]

Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)

[1, 2, 4, 6, 8] [3, 5, 7, 9]

Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.

0
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
            // 2 * O(n log(n))
            Arrays.sort(nums1);
            Arrays.sort(nums2);
            List<List<Integer>> results = new ArrayList<>(k);
            int endIndex = 0;
            // Find the number whose square is the first one bigger than k
            for (int i = 1; i <= k; i++) {
                if (i * i >= k) {
                    endIndex = i;
                    break;
                }
            }
            // The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
            // so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
            // 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
            for (int i = 0; i < endIndex * endIndex; i++) {
                int m = i / endIndex;
                int n = i % endIndex;
                List<Integer> item = new ArrayList<>(2);
                item.add(nums1[m]);
                item.add(nums2[n]);
                results.add(item);
            }
            results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
            return results.stream().limit(k).collect(Collectors.toList());
        }

Key to eliminate O(n^2):

  1. Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.

  2. Downsize iteration over the 2 arrays.

So:

  1. Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)

  2. Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

vince
  • 176
  • 1
  • 4