7

I'm working on a sorting/ranking algorithm that works with quite large number of items and I need to implement the following algorithm in an efficient way to make it work:


There are two lists of numbers. They are equally long, about 100-500 thousand items. From this I need to find the n-th biggest product between these lists, ie. if you create a matrix where on top you have one list, on the side you have the other one and each cell is the product of the number above and the number on the side.

Example: The lists are A=[1, 3, 4] and B=[2, 2, 5]. Then the products are [2, 2, 5, 6, 6, 15, 8, 8, 20]. If I wanted the 3rd biggest from that it would be 8.

The naive solution would be to simply generate those numbers, sort them and then select the n-th biggest. But that is O(m^2 * log m^2) where m is the number of elements in the small lists, and that is just not fast enough.

I think what I need is to first sort the two small lists. That is O(m * log m). Then I know for sure that the biggest one A[0]*B[0]. Second biggest one is either A[0]*B[1] or A[1]*B[0], ...

I feel like this could be done in O(f(n)) steps, independent of the size of the matrix. But I can't figure out an efficient way to do this part.


Edit: There was an answer that got deleted, which suggested to remember position in the two sorted sets and then look at A[a]*B[b+1] and A[a+1]*B[b], returning the bigger one and incrementing a/b. I was going to post this comment before it got deleted:

This won't work. Imagine two lists A=B=[3,2,1]. This will give you matrix like [9,6,3 ; 6,4,2 ; 3,2,1]. So you start at (0,0)=9, go to (0,1)=6 and then the choice is (0,2)=3 or (1,1)=4. However, this will miss the (1,0)=6 which is bigger then both. So you can't just look to the two neighbors but you have to backtrack.

Timmy
  • 73
  • 4
  • 2
    n is bounded to the range (0..m^2), so I don't think you can claim that any O(f(n)) is independent of the size of the matrix. – mbeckish May 17 '12 at 14:06
  • The matrix generated is known as the outer product between the two vectors. – tskuzzy May 17 '12 at 15:13
  • 1
    What is the range of your list values? If in practice the range is a lot smaller than the size of your lists, then an algorithm that is a function of the range size might work out better than an algorithm that is a function of the list size. – mbeckish May 17 '12 at 15:14
  • 3
    Look at the similar question for Kth sum: http://stackoverflow.com/questions/5212037/find-the-kth-largest-sum-in-two-arrays – MBo May 17 '12 at 15:37
  • Your samples of A and B are all sorted. Are we supposed to assume they are always sorted? – user unknown May 17 '12 at 21:33

3 Answers3

4

I think it can be done in O(n log n + n log m). Here's a sketch of my algorithm, which I think will work. It's a little rough.

  1. Sort A descending. (takes O(m log m))
  2. Sort B descending. (takes O(m log m))
  3. Let s be min(m, n). (takes O(1))
  4. Create s lazy sequence iterators L[0] through L[s-1]. L[i] will iterate through the s values A[i]*B[0], A[i]*B[1], ..., A[i]*B[s-1]. (takes O(s))
  5. Put the iterators in a priority queue q. The iterators will be prioritized according to their current value. (takes O(s) because initially they are already in order)
  6. Pull n values from q. The last value pulled will be the desired result. When an iterator is pulled, it is re-inserted in q using its next value as the new priority. If the iterator has been exhausted, do not re-insert it. (takes O(n log s))

In all, this algorithm will take O(m log m + (s + n)log s), but s is equal to either m or n.

recursive
  • 83,943
  • 34
  • 151
  • 241
0

You don't need to sort the the 500 000 elements to get the top 3.

Just take the first 3, put them in a SortedList, and iterate over the list, replacing the smallest of the 3 elements with the new value, if that is higher, and resort the resulting list.

Do this for both lists, and you'll end with a 3*3 matrix, where it should be easy to take the 3rd value.

Here is an implementation in scala.

If we assume n is smaller than m, and A=[1, 3, 4] and B=[2, 2, 5], n=2:

You would take (3, 4) => sort them (4,3)
Then take (2,5) => sort them (5, 2)

You could now do an zipped search. Of course the biggest product now is (5, 4). But the next one is either (4*2) or (5*3). For longer lists, you could keep in mind what the result of 4*2 was, compare it only with the next product, taken the other way. That way you would only calculate one product too much.

Community
  • 1
  • 1
user unknown
  • 35,537
  • 11
  • 75
  • 121
  • But I don't always need the 3rd one. It can be anything from 1 to m^2. If it's in the second half I can reverse the sort and find (m^2 - n)-th smallest. So worst case is getting the (250,000)^2 element, which is a lot. – Timmy May 17 '12 at 14:31
  • Searching 5x5 instead of 10x10 is still a great improvement. (2n*2n) is always 4n², compared to n*n=n² - at least an improvement of 75%. If n compared to m is smaller, the improvement is greater: (3n*3n)=9n² and so on. – user unknown May 17 '12 at 14:51
  • Still O(m^2) which is nowhere near good enough. O(m log m) is acceptable for the initial sort, after that it needs to get more clever. – Timmy May 17 '12 at 14:53
  • @Timmy: I haven't figured out completely, what recursives algorithm does, but he sorts initially A and B, while I would only sort `min (N, M-N)`, in the worst case, A/2, B/2. – user unknown May 17 '12 at 21:36
0

I don't think there is an algorithm of O(f(n)), which is independent of m.

But there is a relatively fast O(n*logm) algo:

At first, we sort the two arrays, we get A[0] > A[1] > ... > A[m-1] and B[0] > B[1] > ... > B[m-1]. (This is O(mlogm), of course.)

Then we build a max-heap, whose elements are A[0]*B[0], A[0]*B[1], ... A[0]*B[m-1]. And we maintain a "pointer array" P[0], P[1], ... P[m-1]. P[i]=x means that B[i]*A[x] is in the heap currently. All the P[i] are zero initially.

In each iteration, we pop the max element from the heap, which is the next largest product. Assuming it comes from B[i]*A[P[i]] (we can record the elements in the heap come from which B[i]), we then move the corresponding pointer forward: P[i] += 1, and push the new B[i] * A[P[i]] into the heap. (If P[i] is moved to out-of-range (>=m), we simply push a -inf into the heap.)

After the n-th iteration, we get the n-th largest product.

There are n iterations, and each one is O(logm).

Edit: add some details

RoBa
  • 418
  • 3
  • 6
  • I think you can consider O(n*logm) to be O(m^2*logm), which would be equivalent to just sorting the entire set of products. – mbeckish May 17 '12 at 15:16
  • @mbeckish Theoretically, you are right. But my solution is faster if n is not \Theta(m^2). So I think at least it's valuable in some cases. – RoBa May 17 '12 at 15:24
  • True, not every value of n will take the same number of steps to solve for a given m. But when talking about big-O, it is still considered O(m^2*logm). I think the OP is looking to beat O(m^2*logm) for any n. – mbeckish May 17 '12 at 15:27