3

I would like to get the top X elements of an array that sum-up to at least a given sum without sorting the whole array beforehand in linear time. I think its not possible to get linear time in all cases but at least in my input arrays i have roughly 1% of the elements that make out 99% of the sum. And i need to identify those correctly. I don't know know if it helps but the sum of all elements is always 1.

I have already implemented it with a sorted array but that blows up the complexity of my algorithm. Afterwards I have already looked into the top-k algorithm and the knapsack algorithm but they do not allow a flexible x elements dependet on a given minimum sum.

Input Array: [0.1, 0.2, 0.4, 0.05, 0.01, 0.01, 0.01, 0.02, 0.15, 0.05]

Example 1:

Given Sum: 0.8

Expected output [0.1, 0.2, 0.4, 0.15, ] --> Sum 0.85 but only top 4 elements

Example 2: 

Given Sum: 0.95

Expected output [0.1, 0.2, 0.4, 0.15, 0.05, 0.05 ] --> Sum 0.95 but only top 6 elements

Really looking forward to your answers!

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
MadMat
  • 55
  • 3
  • How does sorting the array "blow up the complexity"? O(nlogn) is barely more than O(n). Certainly much less than knapsack. – tobias_k Jun 04 '19 at 09:18
  • Hi Tobias, you are right but the top-k algorithm is O(log n) right? So I wonder if i can do something without sorting the list beforehand. Any ideas, if this is possible in O(log n) or O(n) complexity? – MadMat Jun 04 '19 at 13:45
  • Not sure which top-k algorithm you are referring to specifically, but even finding the top-1 element in an unordered list can not be O(log n), as you basically have to look at each element at least once. – tobias_k Jun 04 '19 at 13:48
  • Yea you are right, was a misunderstanding from my side. In the Heap-based version you need O(n log k). My problem is that i do not now the k (which i named x) beforehand because it is dependent on the values in the array and the sum that i want to reach. I thought about something like: 1. Select elements until you reach the sum 2. Iterate trough the remaining array and whenever you detect an element that is bigger than an element in your top-x list put the element in and throw out other elements that sum up to your gain by adding the bigger element. – MadMat Jun 04 '19 at 14:05
  • I am not sure that I understand. Are both X and the sum givens? – RobertBaron Jun 04 '19 at 15:00
  • i think this can be done by modifying the (randomized) quick-select algorithm such that after partitioning you calculate the sum of the left and the right side and narrow down on the place you want to cut – Sopel Jun 04 '19 at 15:46
  • @Sopel isn't that what my answer describes? – גלעד ברקן Jun 04 '19 at 15:51

2 Answers2

2

If we can have a median selection algorithm with good enough likelihood that its time complexity is O(n), then we can have overall O(n). Observe that after selecting the median we only need to examine one of the parts in the partition, leading to N + N/2 + N/4... with a bound of O(n). This is because the wanted sum is either contained in the half above the median or we need to add more from the lower half, in which case we need not examine the upper half.

גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
  • @RobertBaron selection algorithms can be made to have a partition as byproduct. – גלעד ברקן Jun 04 '19 at 15:52
  • @RobertBaron even without any references, if we have to double the iteration on each part I mentioned in order to do the partitioning separately from the selection, it's still O(n) if the median finding algorithm is good enough since O(2*n) = O(n). – גלעד ברקן Jun 04 '19 at 15:58
  • @RobertBaron if we have a median, we partition into n/2 elements greater than it and n/2 elements less than it. Now either the sum is in the part with the larger elements, or we need to add some more lower elements, in which case we don't need to look at the part with the greater ones. (This assumes we sum the parts of course as part of the routine.) (by the way, I did not downvote your answer.) – גלעד ברקן Jun 04 '19 at 16:05
  • @RobertBaron [n/2 + n/4 + n/8... converges to n](https://en.wikipedia.org/wiki/1/2_%2B_1/4_%2B_1/8_%2B_1/16_%2B_%E2%8B%AF) – גלעד ברקן Jun 04 '19 at 16:17
  • @RobertBaron Robert, we only need to look at one part of each partition. We iterate over N, then over N/2, then over N/4, etc. until we have a solution. This sum converges to 2N. What's up? – גלעד ברקן Jun 04 '19 at 16:25
  • @RobertBaron this algorithm assumes we know the median each time (since we've selected it). If the selection algorithm hasn't already partitioned the section in question, we can partition it in one iteration over it by comparing each element to the median. Would a concrete example be helpful to you? – גלעד ברקן Jun 04 '19 at 16:29
  • @RobertBaron how do you get O(n log n)? I have n + n/2 + n/4 + n/8... which converges to 2n. By the way, [linear-time-median-finding](https://rcoh.me/posts/linear-time-median-finding/) – גלעד ברקן Jun 04 '19 at 16:44
  • @RobertBaron here's another good reference: https://stackoverflow.com/questions/10662013/finding-the-median-of-an-unsorted-array. (I'm still waiting to hear your explanation for how you arrive at O(n log n) total assuming O(n) median selection.) – גלעד ברקן Jun 04 '19 at 17:28
  • I think this is the right approach. Will post an update if I have final solution. Thanks everyone for contributing! – MadMat Jun 12 '19 at 16:27
0

You can round your values to say 3 decimal digits, and use the bucket sort. With 3 decimal digits, you will need 1000 buckets. You may use more or fewer buckets depending on your problem. The time complexity will be O(n+k) where k is number of buckets.

In your buckets, you can store the exact values, and so when scanning the buckets to get the desired sum, you would use the actual values. You said the top values usually represent 1% of all values. The top buckets should then contain only a few values.

RobertBaron
  • 2,817
  • 1
  • 12
  • 19
  • Hi Robert, hope you are well. Thanks for contributing. I think the quickselect idea is most relevant. Will update once final solution is there – MadMat Jun 12 '19 at 16:28