16

I know this can be done by sorting the array and taking the larger numbers until the required condition is met. That would take at least nlog(n) sorting time.

Is there any improvement over nlog(n).

We can assume all numbers are positive.

Dan D.
  • 73,243
  • 15
  • 104
  • 123
Shamim Hafiz - MSFT
  • 21,454
  • 43
  • 116
  • 176
  • @equality: Yes, I corrected my post. – Shamim Hafiz - MSFT May 11 '11 at 12:14
  • 3
    Was an O(n log(n)) algorithm insufficient to convince them that you are a good candidate? It sounds like an OK solution to me. – Simen S May 11 '11 at 12:20
  • Not necessarily. But not sure why it would matter? – Shamim Hafiz - MSFT May 11 '11 at 13:01
  • 1
    @equality: S doesn't have to be constant, it just has to be known. If it was constant or bounded you could improve real world performance by allocating buckets in advance, but this makes no difference to the theoretical complexity of the algorithm. – verdesmarald May 11 '11 at 13:13
  • @equality: Oh ok. Well the algorithm needs to run just one time. Therefore S is effectively constant. @veredesmarald: Agreed – Shamim Hafiz - MSFT May 11 '11 at 13:14
  • @equality: Its 50 and not a formula. I am not sure if the question is incorrectly described to make you think it's a function and not a constant value. – Shamim Hafiz - MSFT May 11 '11 at 13:18
  • 1
    @equality: When I said "known" I meant "known in advance of any computation". The reason there is a hard `n lg(n)` limit on comparison sorts is that you don't know in advance what range the values lie in, how many there are, etc. Knowing some of that information up front is what allows the non-comparative sorts to perform better in some instances. – verdesmarald May 11 '11 at 13:47
  • 1
    I think my solution should get accepted. It gives you an optimal answer without needing any kind of fast integer based sorting or making any assumptions on the input. – Rob Neuhaus May 12 '11 at 13:45

5 Answers5

9

Here is an algorithm that is O(n + size(smallest subset) * log(n)). If the smallest subset is much smaller than the array, this will be O(n).

Read http://en.wikipedia.org/wiki/Heap_%28data_structure%29 if my description of the algorithm is unclear (it is light on details, but the details are all there).

  1. Turn the array into a heap arranged such that the biggest element is available in time O(n).
  2. Repeatedly extract the biggest element from the heap until their sum is large enough. This takes O(size(smallest subset) * log(n)).

This is almost certainly the answer they were hoping for, though not getting it shouldn't be a deal breaker.

Edit: Here is another variant that is often faster, but can be slower.

Walk through elements, until the sum of the first few exceeds S.  Store current_sum.
Copy those elements into an array.
Heapify that array such that the minimum is easy to find, remember the minimum.
For each remaining element in the main array:
    if min(in our heap) < element:
        insert element into heap
        increase current_sum by element
        while S + min(in our heap) < current_sum:
            current_sum -= min(in our heap)
            remove min from heap

If we get to reject most of the array without manipulating our heap, this can be up to twice as fast as the previous solution. But it is also possible to be slower, such as when the last element in the array happens to be bigger than S.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • This is O(n + K log n) (and not O(n + K log K) as you claim), isn't it? +1 though. This is definitely the best answer so far. –  May 11 '11 at 15:06
  • @Moron: Oops, you're right. I had another one before that could hit the lower bound, but realized that it would usually be worse. I'll add that solution later. – btilly May 11 '11 at 15:39
  • 1
    @Moron: Second solution added. – btilly May 11 '11 at 16:04
  • +1, this is a much better solution. My kung-fu is weak when it comes to heaps; I didn't know constructing them could be linear time if you know all the elements already, thanks! – verdesmarald May 11 '11 at 23:17
7

Assuming the numbers are integers, you can improve upon the usual n lg(n) complexity of sorting because in this case we have the extra information that the values are between 0 and S (for our purposes, integers larger than S are the same as S).

Because the range of values is finite, you can use a non-comparative sorting algorithm such as Pigeonhole Sort or Radix Sort to go below n lg(n).

Note that these methods are dependent on some function of S, so if S gets large enough (and n stays small enough) you may be better off reverting to a comparative sort.

verdesmarald
  • 11,646
  • 2
  • 44
  • 60
  • I may be wrong, but if you did something like a pigeon hole, you could actually just perform the first piece of the sort (putting the data in the holes) and than pull from the end of the data adding until you hit >= S, this saves you the step of putting the elements back into a sorted order. – pstrjds May 11 '11 at 12:25
  • 1
    You only _know_ the integers range from 0 to S if you assume fixed size integers. If you're using bigints, you're back to `n log n`. – hammar May 11 '11 at 12:33
  • @hammar: What do you mean? The maximum value of the datatype is not important, only the value of S; as I noted in my answer, **for the purposes of this question values larger than S are equivalent to S** since if there are any values >= S in the array the answer is 1. – verdesmarald May 11 '11 at 12:37
  • @veredesmarald: That argument only works if _S_ is of fixed size. – hammar May 11 '11 at 12:44
  • @hammar: I'm sorry but I don't understand what you mean. Obviously for sufficiently large S a comparative sort performs better (as I said originally), but what does the programmatic representation of the value of S have to do with anything? – verdesmarald May 11 '11 at 12:54
  • In other words: The pigeonhole approach will give you an algorithm which is O(n+S). This will inn many cases be better than O(n log(n)). Particularly if S is not much larger than n. – Simen S May 11 '11 at 12:58
  • @veredesmarald: The size of _S_ is _log(S)_, so if your numbers are bounded above by _S_, you will need to do _log(S)_ passes if using radix sort. Now, if _S_ is of fixed size, _log(S)_ will be a constant and will go away, but if it's not you haven't lost the log factor. – hammar May 11 '11 at 13:25
  • @hammar: Did you read my whole answer? I already noted that all non-comparative sorts are dependent on some function of S. The thing to consider when deciding on an approach to a particular instance of this problem is **the relative size of S and n** not just the size of S. Even if S happens to be 2^35, you are still better off with a bucket based sort if n happens to be 2^33. It has nothing to do with bigint or fixed size, **its all about S:n** – verdesmarald May 11 '11 at 13:38
  • @veredesmarald, @hammer: I'm fairly sure that there is no case where this answer is better than the one that I gave. – btilly May 11 '11 at 15:05
6

Here is an O(n) expected time solution to the problem. It's somewhat like Moron's idea but we don't throw out the work that our selection algorithm did in each step, and we start trying from an item potentially in the middle rather than using the repeated doubling approach.

Alternatively, It's really just quickselect with a little additional book keeping for the remaining sum.

First, it's clear that if you had the elements in sorted order, you could just pick the largest items first until you exceed the desired sum. Our solution is going to be like that, except we'll try as hard as we can to not to discover ordering information, because sorting is slow.

You want to be able to determine if a given value is the cut off. If we include that value and everything greater than it, we meet or exceed S, but when we remove it, then we are below S, then we are golden.

Here is the psuedo code, I didn't test it for edge cases, but this gets the idea across.

def Solve(arr, s):
  # We could get rid of worse case O(n^2) behavior that basically never happens 
  # by selecting the median here deterministically, but in practice, the constant
  # factor on the algorithm will be much worse.
  p = random_element(arr)
  left_arr, right_arr = partition(arr, p)
  # assume p is in neither left_arr nor right_arr
  right_sum = sum(right_arr)
  if right_sum + p >= s:
    if right_sum < s:
      # solved it, p forms the cut off
      return len(right_arr) + 1    
    # took too much, at least we eliminated left_arr and p
    return Solve(right_arr, s) 
  else:
    # didn't take enough yet, include all elements from and eliminate right_arr and p
    return len(right_arr) + 1 + Solve(left_arr, s - right_sum - p)  
Rob Neuhaus
  • 9,190
  • 3
  • 28
  • 37
  • +1 - but partitioning-around-random-pivot algorithms (quicksort etc) can have bad worst-case performance when the pivot always happens to be unbalanced. I'm not sure if that means O(n^2) or O(n log n) in this case. Also - the repeated summing (naively implemented) would break the performance claim - you need to keep track of how the sum changes over time, as the partitioning changes the array and as the upper/lower bounds are modified to overcome that issue. –  May 12 '11 at 04:52
  • Yeah, it's O(n^2) in the worst case. You can get rid of that factor and turn it into O(n) worse case by doing a deterministic median find and pivoting around that, but such a solution is basically always going to be slower in practice (embrace randomness!). I improved the psuedocode to not recompute sum(right_arr), but it's not necessary for the asymptotic behavior, only the constant factor. Once we eliminate a piece of the array, we never need to compute their sum, nor examine them again (when we discard left, we never take them, when we discard right, we take all of them). – Rob Neuhaus May 12 '11 at 13:23
  • OK - on the recomputing, I'll trust you - usually my by-intuition guesses are more reliable than my carefully (but never carefully enough) doing the math, but sadly, that "more reliable" is often still pretty unreliable. –  May 12 '11 at 14:54
  • 1
    Nice answer. I prefer this over mine. Also this can be made O(n `log(S)`) worst case without hurting the average case. We've got an array of integers in a limited range. Make every other pivot be on the midpoint of the integers in that range, not on values in the array. This is guaranteed to converge in `O(log(S))` steps. (Alternately you can do the deterministic median find any time the last 3 selects didn't reduce the search space enough. That makes it `O(n)` and in the average case won't hurt performance.) – btilly May 19 '11 at 22:54
5

One improvement (asymptotically) over Theta(nlogn) you can do is to get an O(n log K) time algorithm, where K is the required minimum number of elements.

Thus if K is constant, or say log n, this is better (asymptotically) than sorting. Of course if K is n^epsilon, then this is not better than Theta(n logn).

The way to do this is to use selection algorithms, which can tell you the ith largest element in O(n) time.

Now do a binary search for K, starting with i=1 (the largest) and doubling i etc at each turn.

You find the ith largest, and find the sum of the i largest elements and check if it is greater than S or not.

This way, you would run O(log K) runs of the selection algorithm (which is O(n)) for a total running time of O(n log K).

  • I would not have thought of this because I know the faster standard solution. But it is a clever answer. +1 – btilly May 11 '11 at 15:07
  • @btilly: Only theoretically :-) Your answer is best even practically. I think I knew the standard solution, but my sleepy mind must have blocked it off somehow! –  May 11 '11 at 15:13
0
  1. eliminate numbers < S, if you find some number ==S, then solved
  2. pigeon-hole sort the numbers < S

Sum elements highest to lowest in the sorted order till you exceed S.

BiGYaN
  • 6,974
  • 5
  • 30
  • 43