Write a program to find 100 largest numbers out of an array of 1 billion numbers

Question

I recently attended an interview where I was asked "write a program to find 100 largest numbers out of an array of 1 billion numbers."

I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.

Arrays.sort(array);

The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?

Maybe the problem is that it wasn't a _sorting_ question, but a _seeking_ one. — geomagas, Oct 07 '13 at 14:51
As a technical note, sort may not be the best way to solve the problem, but I don't think it's brute force - I can think of a lot worse ways of doing it. — Bernhard Barker, Oct 07 '13 at 15:29
Another brute force method would be to create a parallel array in which you store the position of each number in the "highest number" competition. You iterate the first element and assign a 1 to it. When you get to the 8701th one you iterate the previous 8700 and "update" their position: Add 1 to it if they are lower than it, and leave it as is otherwise (but in that case add one to the position of the current, 8701th, number). It is probably O(n^2). — Daniel Daranas, Oct 07 '13 at 15:52
See https://en.wikipedia.org/wiki/Partial_sorting and https://en.wikipedia.org/wiki/Selection_algorithm — Bergi, Oct 07 '13 at 15:57
I just thought of an even more stupid brute force method...Find all possible combinations of 100 elements from the 1 billion element array and see which of these combinations has the largest sum. — Shashank, Oct 07 '13 at 15:57
You could also iterate through the array and copy its numbers into a map of sets, in which the key is the number of digits each original number has. Then you'd only need to iterate your map by key in reverse order and keep grabbing your numbers and counting them. At some point you'd reach past 100 numbers so you'd need to select only some of the numbers from the last set; say for example that the sets with more than 9 digits had given you 96 numbers, and the set of numbers with 8 digits contains 9 numbers: you only need 4 of them so you'd need to find them... by brute force, of course :) — Daniel Daranas, Oct 07 '13 at 16:02
This last strategy has its binary counterpart, which is interesting because it could be applied without using extra space. Read the first bit of every number, according to the type in which it is stored. If there are more than 100 `1`, keep all those numbers and discard the ones with `0`; otherwise you already have some winners (say, 63) and you need to keep iterating to find the remaining 37 numbers. You'll do that by looking at the second bit. You'll sweep the numbers left to right, so that you can directly pick the ones with the most leftmost `1`'s. — Daniel Daranas, Oct 07 '13 at 16:10
Note that *all* deterministic (and correct) algorithms are `O(1)` in this case, because there is no dimension increase. The interviewer should have asked "How to find m biggest elements from an array of n with n >> m?". — Bakuriu, Oct 07 '13 at 16:44
@Bakuriu Yes, we were assuming that n was the one billion just by context. The confusion of concepts that the interviewer had is rather common, from my experience. — Daniel Daranas, Oct 07 '13 at 16:46
I may be crazy, but couldn't you use a variation on a radix MSD sort to make this a O(n) algorithm? — MirroredFate, Oct 07 '13 at 17:50
See also: [Get 100 highest numbers from an infinite list](http://programmers.stackexchange.com/q/116346/38614) — Briguy37, Oct 07 '13 at 18:17
Wow, how come this question can get 59 up votes and the best answer 58 upvotes while this question is only 16 hours old? — justhalf, Oct 08 '13 at 07:20
@justhalf It's not that uncommon that _some_ question has more than 50 upvotes after one day. They are a minority, but you can often find one of them. — Daniel Daranas, Oct 08 '13 at 08:17
This question shows research effort; I think I will upvote it. 79 others can't be wrong, after all. — Kaz, Oct 08 '13 at 19:26
Also similar to [How can I sort 1 million numbers, and only print the top 10 in Python?](http://stackoverflow.com/q/9236387/683200) — pepsi, Oct 08 '13 at 20:51
@justhalf Also, it was featured in the stackoverflow newsletter. That's how I got to it, and that's how it got my upvote. — Ben Lee, Oct 08 '13 at 21:08
I have found that using a quicksort is very effective with large number arrays — Keagan Ladds, Oct 09 '13 at 05:52
I think just iterating through each number in the large list and removing numbers from the top eventually becomes more efficient than sorting if `m` stays constant and `n` increases... — Joe Z., Oct 09 '13 at 09:41
[Order statistics](http://en.wikipedia.org/wiki/Order_statistic). — Terry Li, Oct 10 '13 at 04:15
This seems to be a problem of order statistics... find the 100th smallest number say N in the list and then just traverse the array once to select all the numbers lesser than the N. For more check Erik's lecture 6 (MIT analysis of algorithms ) . — dharam, Oct 14 '13 at 12:40
I think we can simply get it in O(n) . We can use bubble sort to get 100 biggest elements using the following code — Shubham Sharma, Jun 11 '15 at 11:18
Possible duplicate of [Retrieving the top 100 numbers from one hundred million of numbers](http://stackoverflow.com/questions/2550624/retrieving-the-top-100-numbers-from-one-hundred-million-of-numbers) — Adrian McCarthy, Dec 15 '16 at 00:29

score 338 · Accepted Answer · edited Jul 20 '22 at 20:35

338

You can keep a priority queue of the 100 biggest numbers, iterate through the 1 billion numbers. Whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

A priority queue implemented with a heap has insert + delete complexity of O(log K). (Where K = 100, the number of elements to find. N = 1 billion, the number of total elements in the array).

In the worst case you get billion*log₂(100) which is better than billion*log₂(billion) for an O(N log N) comparison-based sort¹.

In general, if you need the largest K numbers from a set of N numbers, the complexity is O(N log K) rather than O(N log N), this can be very significant when K is very small comparing to N.

The expected time of this priority queue algorithm is pretty interesting, since in each iteration an insertion may or may not occur.

The probability of the i'th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability.

For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.

Thus, the expected number of insertions is:

enter image description here

And the expected running time can be expressed as:

enter image description here

(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)

Note that when N is very large comparing to K, this expression is a lot closer to n rather than N log K. This is somewhat intuitive, as in the case of the question, even after 10,000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.

But we don't know that the array values are uniformly distributed. They might trend towards increasing, in which case most or all numbers will be be new candidates for the set of 100 largest numbers seen. The worst case for this algorithm is O(N log K).

Or if they trend towards decreasing, most of the largest 100 numbers will be very early, and our best-case run time is essentially O(N + K log K), which is just O(N) for K much smaller than N.

Footnote 1: O(N) integer sorting / histogramming

Counting Sort or Radix Sort are both O(N), but often have larger constant factors that make them worse than comparison sorts in practice. In some special cases they're actually quite fast, primarily for narrow integer types.

For example, Counting Sort does well if the numbers are small. 16-bit numbers would only need an array of 2^16 counters. And instead of actually expanding back into a sorted array, you could just scan the histogram you build as part of Counting Sort.

After histogramming an array, you can quickly answer queries for any order statistic, e.g. the 99 largest numbers, the 200 to 100th largest numbers.) 32-bit numbers would scatter the counts over a much larger array or hash table of counters, potentially needing 16 GiB of memory (4 bytes for each of 2^32 counters). And on real CPUs, probably getting lots of TLB and cache misses, unlike an array of 2^16 elements where L2 cache would typically hit.

Similarly, Radix Sort could look at only the top buckets after a first pass. But the constant factors may still be larger than log K, depending on K.

Note that the size of each counter is large enough to not overflow even if all N integers are duplicates. 1 billion is somewhat below 2^30, so a 30-bit unsigned counter would be sufficient. And a 32-bit signed or unsigned integer is just fine.

If you had many more, you might need 64-bit counters, taking twice the memory footprint to initialize to zero and to randomly access. Or a sentinel value for the few counters that overflow a 16 or 32-bit integer, to indicate that the rest of the count is elsewhere (in a small dictionary such as a hash table mapping to 64-bit counters).

edited Jul 20 '22 at 20:35

Peter Cordes

328,167
45
605
847

answered Oct 07 '13 at 14:45

Ron Teller

1,880
1
12
23

1

Good idea, wrong type of container. You want a container that has good insertion at any point and good deletion from one end (the lowest value). A linked list might be better (especially if you re-use the removed node rather than delete / allocate) – Skizz Oct 07 '13 at 15:02
@Skizz A priority queue implemented with a heap would work fine. O(logN) insert and remove. – Dev Oct 07 '13 at 15:05
6

It is actually only *O(100)* for each insert. – MrSmith42 Oct 07 '13 at 15:10
8

@RonTeller You can't binary search a linked list efficiently, that is why a priority queue is usually implemented with a heap. Your insertion time as described is O(n) not O(logn). You had it right the first time (ordered queue or priority queue) until Skizz made you second guess yourself. – Dev Oct 07 '13 at 15:15
2

@RonTeller Maintaining the pointer array destroys the advantages of using a linked list with regards to add and remove time. There is no way around it. – Dev Oct 07 '13 at 15:22
2

@Dev Ok, got what you're saying, i'll update my answer, thanks – Ron Teller Oct 07 '13 at 15:25
Please also add, that n in the priority queue case (let's better call it k) is TINY and CONSTANT compared to the large list of integers. So the overall complexity is much more like O(n * log 100), which is a smooth O(n). Also this algorithm can be done online and doesn't require the whole dataset in memory. – Thomas Jungblut Oct 07 '13 at 15:29
@ThomasJungblut it's not O(n), as K doesn't have to be constant (k=n/2 for example), but i added an explanation – Ron Teller Oct 07 '13 at 15:33
@RonTeller if you look at the general top K integer case, then yes. But he asked about k=100 ;) – Thomas Jungblut Oct 07 '13 at 15:35
19

@ThomasJungblut billion is also a constant, so if that's the case it's O(1) :P – Ron Teller Oct 07 '13 at 15:35
@RonTeller it is about the relation between k and n. – Thomas Jungblut Oct 07 '13 at 15:52
9

@RonTeller: normally this kind of questions concerns thinks like finding 10 top pages from billions of Google search results, or 50 most frequent words for a word cloud, or 10 most popular songs on MTV, etc. So, I believe, in _normal circumstances_ it's safe to consider `k` _constant_ and _small_ compared to `n`. Though, one should always keep in mind this "normal circumstances". – ffriend Oct 07 '13 at 16:05
2

O(100) is nonsense. All O(constant) is O(1). – Kaz Oct 08 '13 at 19:27
@MrSmith42 that would be the worst case which would be 100*x and not O(100) and the average case expected the set to have a random distribution would be 50*x. However if the set of a billion values is sorted the average will be either 1 or 100 and since all those numbers (except the optimal case) depends on the size of the queue (k) it's a lot easier to just say O(k) – Rune FS Oct 08 '13 at 19:56
2

@Kaz: You are right `O(100)=O(constant)=O(1)`. It does not matter which constat you use, therfore I do not get what is 'nonsense' about `O(100)`. – MrSmith42 Oct 08 '13 at 20:03
@Rune FS: I wrote 'O(100) for each insert' which is about 100*x. So I do not get what you think was wrong with my comment. – MrSmith42 Oct 08 '13 at 20:10
1

"even after 10000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small." -- unless, that is, the input array happens to already be sorted. – Ben Lee Oct 08 '13 at 21:12
Keep a sorted list of the top 100 numbers you have seen, but you care most about the smallest of those 100, as that is the one that falls off the end. Scan the source set, and you only insert into the set of 100 when you have something bigger than the smallest. – ChuckCottrill Oct 09 '13 at 02:27
@RonTeller: Your calculation of runtime is overcomplicated and probably wrong. Insertion time for the first k elements is linear since you can make a heap in linear time. (And anyway, log k! ≅ klogk by Stirling's approximation.) The number of rejected elements after that is less than n-k, and the expected number of accepted elements is k(digamma(n) - digamma(k) < klog(n). Therefore, the total average time is bounded by n + klog(n). – Neil G Oct 09 '13 at 03:17
5

Since you have 1G items, sample 1000 elements randomly, and pick the largest 100. That should avoid the degenerate cases (sorted, reverse sorted, mostly sorted), reducing the number of inserts considerably. – ChuckCottrill Oct 09 '13 at 03:17
@NeilG (a) it's not just a heap, it's a priority queue, you can't generate any sorted data structure in O(n). (b) why use an approximation when I can write the actual expression? it's easier to understand. (c) the calculation gives a close assessment of the expected running time, it's not a bound. – Ron Teller Oct 09 '13 at 06:19
@RonTeller: Sure, but the heap is the most efficient implementation of the priority queue for this application. b) your actual expression is an overestimate anyway because you assume log(k) steps to place the element in your priority queue, whereas on average 2 comparisons are required. c) I don't think it is that close and I have pointed out two overestimates. There should not be an n log(k) factor if you've calculated correctly, but rather a k log(n) (or digamma(n)). – Neil G Oct 09 '13 at 06:28
@NeilG The fact it's implemented with a heap doesn't make it a heap. I agree about the overestimation of the insertion time (which in the average case is log(k)/2, not 2). Anyway, the main idea of the calculation is to explain expected number of insertions. – Ron Teller Oct 09 '13 at 06:43
@RonTeller: yes, but you got that number wrong! :) It should be: http://latex.codecogs.com/png.latex?%5Cbegin%7Balign%7D%20k%20+%20n%20-%20k%20+%20%282-1%29%5Csum_%7Bi%3Dk+1%7D%5En%20%5Cfrac%7Bk%7D%7Bi%7D%20%26%3D%20n%20+%20k%28%5Cpsi%28n+1%29%20-%20%5Cpsi%28k+1%29%29%20%5C%5C%20%26%3C%20n%20+%20k%5Clog%28n+1%29%20%5Cend%7Balign%7D – Neil G Oct 09 '13 at 06:47
@NeilG No, I didn't, the expected number of insertions after the first k elements is a sum of `k/i` where `i=k+1...n`, just like in the calculation, and that **is** accurate. – Ron Teller Oct 09 '13 at 06:50
On second thought, you're right about the logarithmic time, but then why not simplify your equation using the digamma identity, taking i/i out of the summation, and replacing log(k!) with k? Then you'll get n + k(logn)(logk), nice and simple. – Neil G Oct 09 '13 at 07:14
@MrSmith42 there's nothing specific about 100 so using O(x) notation the convention would be O(1) so I expected yoru 100 to be significat for your statement. but seeing that 100 is the worst case for an insertion I commented that 50 would be as good since that's the average case for a random set but the best case is one comparison pr. insert. In other words it's not a constant – Rune FS Oct 09 '13 at 08:16
1

I think the assumption that the numbers are uniform is a big assumption. I would want to sample some of the numbers first to determine a threshold. I'm sure there's an optimal number, but I would sample about 1,000 of the 1e9 numbers and use them to initialize the priority queue. – geneorama Oct 09 '13 at 14:14
FYI: it's easy to see that sum_{i=k+1}^n 1/i < sum_{i=1}^n 1/i < log(n+1) by noting that the latter is the antiderivative of the continuous function (1/i) that is larger than the Riemann sum being taken. – Neil G Oct 09 '13 at 19:44
Also notice `sum_{i=k+1}^n 1/i = ln(n)-ln(k)+O(1)` by Euler–Mascheroni. – Thomas Ahle Mar 30 '14 at 10:06
Why can't we create a heap of billion elements and extract 100 largest elements. This way cost = O(billion) + 100*O(log(billion)) ?? – Mohit Shah Sep 13 '16 at 15:20
Lets say the head of the queue is 1000 and all the elements of the queue are from 1 to 99. Now the next number encountered is 500, then according to the above solution, since 500 is less than 1000 so it should not enter the queue? – Juvenik Feb 14 '19 at 03:16

score 144 · Answer 2 · edited Jul 20 '22 at 19:29

144

If this is asked in an interview, the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.

The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people's age then it's a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an integer array of size 200 (maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it's a piece of cake to find 100 largest numbers. By the way this algorithm is called counting sort.

Anyway, making the question more specific and clearer is good for you in an interview.

edited Jul 20 '22 at 19:29

shim

9,289
12
69
108

answered Oct 08 '13 at 18:04

jin

1,542
1
9
10

30

Very good points. Nobody else has asked or indicated anything about the distribution of those numbers - it could make all the difference in how to approach the problem. – NealB Oct 08 '13 at 19:29
13

I'd like this answer enough to extend it. Read the numbers one time through to get the min/max values so that you can assume distribution. Then, take one of two options. If the range is small enough, build an array where you can simply check off numbers as they occur. If the range is too large, use the sorted heap algorithm discussed above.... Just a thought. – Richard_G Oct 08 '13 at 21:07
2

I agree, asking question back to interviewer indeed makes a lot of difference. In fact, a question such as are you limited by compute power or not can also help you parallelize the solution by using multiple compute nodes. – Sumit Nigam Oct 09 '13 at 11:08
1

@R_G No need to go through the whole list. Enough to sample a small fraction (e.g., one million) of random members of the list to obtain useful statistics. – Itamar Oct 14 '13 at 08:08
For those who would not have thought about that solution, I'd recommend to read about the counting sort http://en.wikipedia.org/wiki/Counting_sort. That's actually a pretty common interview question: can you sort an array in better than O(nlogn). This question is just an extend. – Maxime Chéramy Nov 04 '13 at 17:46

score 71 · Answer 3 · edited Oct 07 '13 at 19:34

71

You can iterate over the numbers which takes O(n)

Whenever you find a value greater than the current minimum, add the new value to a circular queue with size 100.

The min of that circular queue is your new comparison value. Keep on adding to that queue. If full, extract the minimum from the queue.

edited Oct 07 '13 at 19:34

jcolebrand

15,889
12
75
121

answered Oct 07 '13 at 14:45

Regenschein

1,514
1
13
26

3

This doesn't work. e.g. find top 2 of {1, 100, 2, 99} will give {100,1} as the top 2. – Skizz Oct 07 '13 at 14:59
7

You cannot get around to hold the queue sorted. (if you do not want to search the hole queue every time for the next smallest element) – MrSmith42 Oct 07 '13 at 15:12
4

@MrSmith42 Partial sorting, as in a heap, is sufficient. See Ron Teller’s answer. – Christopher Creutzig Oct 07 '13 at 20:35
2

Yes, I silently assumed that an extract-min-queue is implemented as a heap. – Regenschein Oct 08 '13 at 08:24
2

Instead of circular queue use min heap of size 100, this will have minimum of hundred number at top. This will take only O(log n) for insert as compared to o(n) in case of queue – techExplorer Oct 09 '13 at 05:06
How do you find the next min once the current min has been banished? – etayluz May 11 '16 at 20:15
A heap isn't a "circular queue" (aka ring buffer; https://en.wikipedia.org/wiki/Circular_buffer). A ring buffer is just a FIFO queue. A heap is a *priority* queue, which is what you need, and which has O(log K) complexity to extract the min (and to insert a new value and re-heapify, depending on implementation: https://en.wikipedia.org/wiki/Heap_(data_structure)#Comparison_of_theoretic_bounds_for_variants) – Peter Cordes Jul 20 '22 at 19:30

score 36 · Answer 4 · answered Oct 08 '13 at 22:09

I realized that this is tagged with 'algorithm', but will toss out some other options, since it probably should also be tagged 'interview'.

What is the source of the 1 billion numbers? If it is a database then 'select value from table order by value desc limit 100' would do the job quite nicely - there might be dialect differences.

Is this a one-off, or something that will be repeated? If repeated, how frequently? If it is a one-off and the data are in a file, then 'cat srcfile | sort (options as needed) | head -100' will have you quickly doing productive work that you are getting paid to do while the computer handles this trivial chore.

If it is repeated, you would advise picking any decent approach to get the initial answer and store / cache the results so that you could continuously be able to report the top 100.

Finally, there is this consideration. Are you looking for an entry level job and interviewing with a geeky manager or future co-worker? If so, then you can toss out all manner of approaches describing the relative technical pros and cons. If you are looking for a more managerial job, then approach it like a manager would, concerned with the development and maintenance costs of the solution, and say "thank you very much" and leave if that is the interviewer wants to focus on CS trivia. He and you would be unlikely to have much advancement potential there.

Better luck on the next interview.

Exceptional answer. Everyone else has concentrated on the technical side of the question, while this response tackles the business social part of it. — vbocan, Oct 09 '13 at 07:23
I never imagined you could say thank you and leave an interview and not wait for it to finish. Thanks for opening my mind. — UrsulRosu, Oct 14 '13 at 06:49
Why can't we create a heap of billion elements and extract 100 largest elements. This way cost = O(billion) + 100*O(log(billion)) ?? — Mohit Shah, Sep 13 '16 at 15:18

score 18 · Answer 5 · answered Oct 07 '13 at 18:50

18

My immediate reaction for this would be to use a heap, but there is way to use QuickSelect without keeping all of the input values on hand at any one time.

Create an array of size 200 and fill it up with the first 200 input values. Run QuickSelect and discard the low 100, leaving you with 100 free places. Read in the next 100 input values and run QuickSelect again. Continue until you have run though the entire input in batches of 100.

At the end you have the top 100 values. For N values you have run QuickSelect roughly N/100 times. Each Quickselect cost about 200 times some constant, so the total cost is 2N times some constant. This looks linear in the size of the input to me, regardless of the parameter size that I am hardwiring to be 100 in this explanation.

answered Oct 07 '13 at 18:50

mcdowella

19,301
2
19
25

11

You can add a small but possibly important optimization: After running QuickSelect to partition the size 200 array, the minimum of the top 100 elements is known. Then, when iterating over the whole data set, only fill up the lower 100 values if the current value is greater than the current minimum. A simple implementation of this algorithm in C++ is on par with libstdc++'s `partial_sort` run directly on a data set of 200 million 32-bit `int` (created via a MT19937, uniformly distributed). – dyp Oct 07 '13 at 20:52
1

Good idea - doesn't affect the worst case analysis but looks well worth doing. – mcdowella Oct 08 '13 at 04:28
@mcdowella It's worth a try and I will do it, thanks! – userx Oct 08 '13 at 14:09
9

This is exactly what [Guava's](http://guava-libraries.googlecode.com) [`Ordering.greatestOf(Iterable, int)`](http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/Ordering.html#greatestOf(java.util.Iterator,%20int)) does. It's absolutely linear-time and single-pass, and it's a super cute algorithm. FWIW, we also have some actual benchmarks: its constant factors are a hair slower than the traditional priority queue in the average case, but this implementation is much more resistant to "worst-case" input (e.g. strictly ascending input). – Louis Wasserman Oct 08 '13 at 20:42

score 16 · Answer 6 · edited May 23 '17 at 11:46

16

You can use Quick select algorithm to find the number at the(by order) index [billion-101] and then iterate over the numbers and to find the numbers that biger from that number.

array={...the billion numbers...} 
result[100];

pivot=QuickSelect(array,billion-101);//O(N)

for(i=0;i<billion;i++)//O(N)
   if(array[i]>=pivot)
      result.add(array[i]);

This algorithm Time is: 2 X O(N) = O(N) (Average case performance)

The second option like Thomas Jungblut suggest is:

Use Heap building the MAX heap will take O(N),then the top 100 max numbers will be in the top of the Heap, all you need is to get them out from the heap(100 X O(Log(N)).

This algorithm Time is:O(N) + 100 X O(Log(N)) = O(N)

edited May 23 '17 at 11:46

Community

1
1

answered Oct 07 '13 at 15:13

One Man Crew

9,420
2
42
51

8

You are working through the entire list three times. 1 bio. integers are roughly 4gb, what would you do if you can't fit them into memory? quickselect is the worst possible choice in this case. Iterating once and keeping a heap of the top 100 items is IMHO the best performing solution in O(n) (note that you can cut off the O(log n) of heap inserts as n in the heap is 100 = constant = very tiny). – Thomas Jungblut Oct 07 '13 at 15:20
3

Even though it is still `O(N)`, doing two QuickSelects and another linear scan is way more overhead than needed. – Kevin Oct 07 '13 at 15:23
This is PSEUDO code all the solutions here will take more time(O (NLOG(N) or 100*O(N) ) – One Man Crew Oct 07 '13 at 15:35
1

`100*O(N)` (if that's valid syntax) = `O(100*N)` = `O(N)` (admittedly 100 may be variable, if so, this is not strictly true). Oh, and [Quickselect has worst-case performance of O(N^2)](http://en.wikipedia.org/wiki/Quickselect) (ouch). And if it doesn't fit into memory, you'll be reloading the data from disk twice, which is a lot worse than once (this is the bottleneck). – Bernhard Barker Oct 07 '13 at 15:43
There is the issue that this is expected running time, and not worst case, but by using a decent pivot selection strategy (e.g. pick 21 elements at random, and choose the median of those 21 as pivot), then the number of comparisons can be guaranteed with high probability to be at most (2+c)n for an arbitrarily small constant c. – One Man Crew Oct 07 '13 at 15:50
@OneManCrew It's sort of remarkable that the consensus here is actually getting the problem wrong. By modifying the quickselect algorithm to choose a pivot with expected high rank, this could even be made to run in n(1+c)+o(n) comparisons for an arbitrarily small c. – mrip Oct 07 '13 at 16:15
Your complexity is wrong though. For the heap it is O(n * log k), where n is the number of integers, and k is the number of top items to find. That's a huge difference to 100 * O(n) whatever that should mean. – Thomas Jungblut Oct 07 '13 at 18:08
@Neil G still O(2Klog(N) + N) = O(N) and if k=100 in your case 200log(N) + N>> 100log(N) + N – One Man Crew Oct 07 '13 at 20:56
Yes, they're both linear, but yours is twice as slow on average, and quadratic worst case. Note that quickselect is 2N (made an error above) and 100log(N) + N << 2N. – Neil G Oct 07 '13 at 20:58

mrip · Answer 7 · 2013-10-09T17:19:33.890

11

Although the other quickselect solution has been downvoted, the fact remains that quickselect will find the solution faster than using a queue of size 100. Quickselect has an expected running time of 2n + o(n), in terms of comparisons. A very simply implementation would be

array = input array of length n
r = Quickselect(array,n-100)
result = array of length 100
for(i = 1 to n)
  if(array[i]>r)
     add array[i] to result

This will take 3n + o(n) comparisons on average. Moreover, it can be made more efficient using the fact that quickselect will leave the largest 100 items in the array in the 100 right-most locations. So in fact, the running time can be improved to 2n+o(n).

There is the issue that this is expected running time, and not worst case, but by using a decent pivot selection strategy (e.g. pick 21 elements at random, and choose the median of those 21 as pivot), then the number of comparisons can be guaranteed with high probability to be at most (2+c)n for an arbitrarily small constant c.

In fact, by using an optimized sampling strategy (e.g. sample sqrt(n) elements at random, and choose the 99th percentile), the running time can be gotten down to (1+c)n + o(n) for arbitrarily small c (assuming that K, the number of elements to be selected is o(n)).

On the other hand, using a queue of size 100 will require O(log(100)n) comparisons, and log base 2 of 100 is approximately equal to 6.6.

If we think of this problem in the more abstract sense of choosing the largest K elements from an array of size N, where K=o(N) but both K and N go to infinity, then the running time of the quickselect version will be O(N) and the queue version will be O(N log K), so in this sense quickselect is also asymptotically superior.

In comments, it was mentioned that the queue solution will run in expected time N + K log N on a random input. Of course, the random input assumption is never valid unless the question states it explicitly. The queue solution could be made to traverse the array in a random order, but this will incur the additional cost of N calls to a random number generator as well as either permuting the entire input array or else allocating a new array of length N containing the random indices.

If the problem doesn't allow you to move around the elements in the original array, and the cost of allocating memory is high so duplicating the array is not an option, that is a different matter. But strictly in terms of running time, this is the best solution.

edited Oct 09 '13 at 17:19

answered Oct 07 '13 at 15:42

mrip

14,913
4
40
58

4

Your last paragraph is the key point: with a billion numbers, it's not feasible to hold all the data in memory or to swap elements around. (At least that's how I would interpret the problem, given that it was an interview question.) – Ted Hopp Oct 07 '13 at 16:07
14

In any algorithmic question, if reading the data is an issue, it must be mentioned in the question. The question states "given an array" not "given an array on disk that doesn't fit in memory and can't be manipulated according to the von neuman model which is the standard in analysis of algorithms". These days you can get a laptop with 8gigs of ram. I'm not sure where the idea of holding a billion numbers in memory is not feasible comes from. I have several billion numbers in memory on my workstation right now. – mrip Oct 07 '13 at 16:12
FYI Worst-case runtime of quickselect is O(n^2) (see http://en.wikipedia.org/wiki/Quickselect), and it also modifies the order of elements in the input array. It's possible to have a worst-case O(n) solution, with a very large constant (http://en.wikipedia.org/wiki/Median_of_medians). – pts Oct 07 '13 at 19:10
The worst case of quickselect is exponentially unlikely to happen, which means that for practical purposes this is irrelevant. It is easy to modify quickselect so that with high probability the number of comparisons is (2+c)n+o(n) for arbitrarily small c. – mrip Oct 07 '13 at 19:25
"the fact remains that quickselect will find the solution faster than using a queue of size 100" — Nope. The heap solution takes about N + Klog(N) comparisons versus 2N average for quickselect and 2.95 for Median of Medians. It is clearly faster for the given K. – Neil G Oct 08 '13 at 04:51
@NeilG I think you mean N + N log K. – mrip Oct 08 '13 at 11:09
@mrip: Nope, Klog(N). You need to estimate the number of elements accepted in the heap, which is actually K(digamma(N) - digamma(K)), which is roughly Klog(N). – Neil G Oct 08 '13 at 20:10
@NeilG Aha. Didn't think of that. Good point. In guess the queue is faster after all. A modified quickselect can be designed to run in n+o(n) (by choosing a very large pivot), but it looks the queue will probably still be faster, at least for K sufficiently small. – mrip Oct 08 '13 at 20:55
The fastest you can get quickselect (on average) to go is 2 comparisons per element. Consider that even if the pivots are the exact median, you are halving the size of the list each time. Then, you have sublist sizes of 1 + 1/2 + 1/4 + 1/8... until you find the desired order statistic. – Neil G Oct 08 '13 at 21:04
If you are searching for a very high rank element, you can get it faster. For example, pick a sample of size sqrt(n), sort it, and choose the element at the 99th percentile as your pivot. On average, you will end up with just 1% of the elements during the recursive call (assuming K/N<<0.01 and so it is safe to assume that the Kth element will be in the top part of the partition). – mrip Oct 08 '13 at 21:13
Nice idea. I wonder what the average number of comparisons is. I think the recurrence is roughly T(n) = T(sqrt(n)) + n + 0.5sqrt(n)log(n). What does the master method say? – Neil G Oct 08 '13 at 21:47
For K small enough, the number of comparisons can be gotten down tosomething like n+O(sqrt n log n), maybe even n + O(sqrt(n)). When there's just one (or a constant number) of sublinear recursive call, then you don't really need the master theorem. The calls shrink so quickly that only the first one shows up in the sum. All in all, it looks like the queue technique will run faster up until about K=sqrt(n) or maybe K=sqrt(n)log(n). For K>N^(1/2+c) I think a version of quickselect can be crafted to go faster. – mrip Oct 08 '13 at 21:54
Come to think of it, I'm not sure your square root solution works. For example, if you're trying to get the 99th percentile and you draw sqrt(n) numbers, and choose the 99th percentile of those, half the time you'll overshoot. In that case, when you partition the entire list of n numbers, you'll find that you have to search the larger partition containing 99 percent of the elements. I'm back to believing quickselect can't do better than 2n comparisons in the average case. If you're right that it can go faster, you should be able to find a reference. – Neil G Oct 09 '13 at 01:11
The assumption was that Y< – mrip Oct 09 '13 at 01:49
NICE. Looks like you're right. I wonder how many comparisons you need on average as a function of n and k. – Neil G Oct 09 '13 at 03:13
I thought of something else. The N+KlogN bound for using a queue assumes that the input is random, which is almost never a safe assumption. Of course, you could traverse the input in random order, but this requires N calls to a random number generator plus either rearranging the entire input array, or allocating a new array of indices of the size N. I think I'm back to thinking that quickselect is better for most values of K even if K – mrip Oct 09 '13 at 11:35
To "defend" against bad performance on non-uniform inputs, [gnasher's answer](https://stackoverflow.com/questions/19227698/write-a-program-to-find-100-largest-numbers-out-of-an-array-of-1-billion-numbers/36410401#36410401) suggests sampling some chunks scattered around to init your PQueue heap before scanning the array. If there's an increasing trend to the data, a chunk near the end is likely to have most of the K highest elements, for example. – Peter Cordes Jul 20 '22 at 20:57

score 5 · Answer 8 · answered Oct 07 '13 at 14:56

5

take the first 100 numbers of the billion and sort them. now just iterate through the billion, if the source number is higher than the smallest of 100, insert in sort order. What you end up with is something much closer to O(n) over the size of the set.

answered Oct 07 '13 at 14:56

Samuel Thurston

85
1

3

oops didn't see the more detailed answer than my own. – Samuel Thurston Oct 07 '13 at 14:58
Take the first 500 or so numbers and only stop to sort (and throw out the low 400) when the list fills up. (And it goes without saying that you then only add to the list if the new number is > the lowest in the selected 100.) – Hot Licks Oct 08 '13 at 20:17
If it's uniformly distributed, yes about O(N). But worst-case `O(N * K)` if the array is increasing so you need to insert into a sorted list every time, where inserting costs on average K (like one step of an InsertionSort). This is why you use a Priority Queue with a heap, to make it `O(N * log K)`. – Peter Cordes Jul 20 '22 at 21:26

score 4 · Answer 9 · answered Oct 09 '13 at 00:27

4

Two options:

(1) Heap (priorityQueue)

Maintain a min-heap with size of 100. Traverse the array. Once the element is smaller than first element in heap, replace it.

InSERT ELEMENT INTO HEAP: O（log100）
compare the first element: O(1)
There are n elements in the array, so the total would be O(nlog100), which is O(n)

(2) Map-reduce model.

This is very similar to word count example in hadoop. Map job: count every element's frequency or times appeared. Reduce: Get top K element.

Usually, I would give the recruiter two answers. Give them whatever they like. Of course, map reduce coding would be labor-some because you have to know every exact parameters. No harm to practice it. Good Luck.

answered Oct 09 '13 at 00:27

Chris Su

543
8
16

+1 for MapReduce, I can't believe you were the only one mentioning Hadoop for a billion numbers. What if the interviewer asked for 1k billion numbers? You deserve more up votes in my opinion. – Silviu Burcea Oct 22 '13 at 05:56
@Silviu Burcea Thanks a lot. I do value MapReduce too. :) – Chris Su Oct 23 '13 at 00:57
Although the size of 100 is constant in this example, you should really generalise this to a separate variable ie. k. As 100 is as constant as 1 billion, so why are you giving the size of the large set of numbers a size variable of n, and not for the smaller set of numbers? Really your complexity should be O(nlogk) which is not O(n). – Tom Heard Dec 30 '13 at 01:15
@TomHeard Yea. Indeed you are right from your point of view. I was answering the question. The question is find 100 largest numbers of 1 billion nodes. The time complexity is O(nlogk). Comparing 1 billion, 100 is trivial, so I can say it's O(n) because no matter what n is, k will always be 100. The increasing level or scope is trivial. – Chris Su Dec 31 '13 at 23:12
2

But my point is if you are just answering the question, 1 billion is also fixed in the question so why generalise 1 billion to n and not 100 to k. Following your logic, the complexity should actually be O(1) because both 1 billion and 100 are fixed in this question. – Tom Heard Jan 05 '14 at 22:49
1

@TomHeard All right. O(nlogk) There is only one factor which will affect the results. This means, if n is increasing larger and larger, the "result level" will increase linearly. Or we can say, even given trillion numbers, I can still get 100 largest numbers. However, you can't say: With increasing n, the k is increasing so that the k will affect the result. That's why I use O(nlogk) but not O(nlogn) – Chris Su Jan 09 '14 at 06:25
I never said it was O(nlogn), I have only ever said it should be O(nlogk) not O(n) (which is what you originally stated it was, which is wrong as 100 (k) is as variable as 1 billion (n) is in this question) – Tom Heard Jan 09 '14 at 22:31

James Oravec · Answer 10 · 2013-10-09T16:45:46.730

4

An very easy solution would be to iterate through the array 100 times. Which is O(n).

Each time you pull out the largest number (and change its value to the minimum value, so that you don't see it in the next iteration, or keep track of indexes of previous answers (by keeping track of indexes the original array can have multiple of the same number)). After 100 iterations, you have the 100 largest numbers.

edited Oct 09 '13 at 16:45

answered Oct 09 '13 at 15:41

James Oravec

19,579
27
94
160

2

Two disadvantages - (1) You're destroying the input in the process - this is preferably avoided. (2) You're going through the array multiple times - if the array is stored on disk and can't fit into memory, this could easily be almost 100 times slower than the accepted answer. (Yes, they're both O(n), but still) – Bernhard Barker Oct 09 '13 at 16:16
Good call @Dukeling, I added additional wording on how to avoid altering the original input by keeping track of previous answer indices. Which would still be pretty easy to code. – James Oravec Oct 09 '13 at 16:48
1

A brilliant example of a O (n) solution that is much slower than O (n log n). log2 (1 billion) is only 30... – gnasher729 Apr 04 '16 at 18:29
@gnasher729 How large is the constant hidden in O(n log n)? – miracle173 May 27 '17 at 21:59

score 3 · Answer 11 · answered Apr 04 '16 at 18:42

The simple solution would be using a priority queue, adding the first 100 numbers to the queue and keeping track of the smallest number in the queue, then iterating through the other billion numbers, and each time we find one that is larger than the largest number in the priority queue, we remove the smallest number, add the new number, and again keep track of the smallest number in the queue.

If the numbers were in random order, this would work beautiful because as we iterate through a billion random numbers, it would be very rare that the next number is among the 100 largest so far. But the numbers might not be random. If the array was already sorted in ascending order then we would always insert an element to the priority queue.

So we pick say 100,000 random numbers from the array first. To avoid random access which might be slow, we add say 400 random groups of 250 consecutive numbers. With that random selection, we can be quite sure that very few of the remaining numbers are in the top hundred, so the execution time will be very close to that of a simple loop comparing a billion numbers to some maximum value.

Vivian Miranda · Answer 12 · 2013-10-27T15:29:27.960

This question would be answered with N log(100) complexity (instead of N log N) with just one line of C++ code.

 std::vector<int> myvector = ...; // Define your 1 billion numbers. 
                                 // Assumed integer just for concreteness 
 std::partial_sort (myvector.begin(), myvector.begin()+100, myvector.end());

The final answer would be a vector where the first 100 elements are guaranteed to be the 100 biggest numbers of you array while the remaining elements are unordered

C++ STL (standard library) is quite handy for this kind of problems.

Note: I am not saying that this is the optimal solution, but it would have saved your interview.

miracle173 · Answer 13 · 2013-10-10T20:03:10.180

The simplest solution is to scan the billion numbers large array and hold the 100 largest values found so far in a small array buffer without any sorting and remember the smallest value of this buffer. First I thought this method was proposed by fordprefect but in a comment he said that he assumed the 100 number data structure being implemented as a heap. Whenever a new number is found that is larger then the minimum in the buffer is overwritten by the new value found and the buffer is searched for the current minimum again. If the numbers in billion number array are randomly distributed most of the time the value from the large array is compared to the minimum of the small array and discarded. Only for a very very small fraction of number the value must be inserted into the small array. So the difference of manipulating the data structure holding the small numbers can be neglected. For a small number of elements it is hard to determine if the usage of a priority queue is actually faster than using my naive approach.

I want to estimate the number of inserts in the small 100 element array buffer when the 10^9 element array is scanned. The program scans the first 1000 elements of this large array and has to insert at most 1000 elements in the buffer. The buffer contains 100 element of the 1000 elements scanned, that is 0.1 of the element scanned. So we assume that the probability that a value from the large array is larger than the current minimum of the buffer is about 0.1 Such an element has to be inserted in the buffer . Now the program scans the next 10^4 elements from the large array. Because the minimum of the buffer will increase every time a new element is inserted. We estimated that the ratio of elements larger than our current minimum is about 0.1 and so there are 0.1*10^4=1000 elements to insert. Actually the expected number of elements that are inserted into the buffer will be smaller. After the scan of this 10^4 elements fraction of the numbers in the buffer will be about 0.01 of the elements scanned so far. So when scanning the next 10^5 numbers we assume that not more than 0.01*10^5=1000 will be inserted in the buffer. Continuing this argumentation we have inserted about 7000 values after scanning 1000+10^4+10^5+...+10^9 ~ 10^9 elements of the large array. So when scanning an array with 10^9 elements of random size we expect not more than 10^4 (=7000 rounded up) insertions in the buffer. After each insertion into the buffer the new minimum must be found. If the buffer is a simple array we need 100 comparison to find the new minimum. If the buffer is another data structure (like a heap) we need at least 1 comparison to find the minimum. To compare the elements of the large array we need 10^9 comparisons. So all in all we need about 10^9+100*10^4=1.001 * 10^9 comparisons when using an array as buffer and at least 1.000 * 10^9 comparisons when using another type of data structure (like a heap). So using a heap brings only a gain of 0.1% if performance is determined by the number of comparison. But what is the difference in execution time between inserting an element in a 100 element heap and replacing an element in an 100 element array and finding its new minimum?

At the theoretical level: How many comparisons are needed for inserting in a heap. I know it is O(log(n)) but how large is the constant factor? I
At the machine level: What is the impact of caching and branch prediction on the execution time of a heap insert and a linear search in an array.
At the implementation level: What additional costs are hidden in a heap data structure supplied by a library or a compiler?

I think these are some of the questions that have to be answered before one can try to estimate the real difference between the performance of a 100 element heap or a 100 element array. So it would make sense to make an experiment and measure the real performance.

The top of the heap is the minimum element in the heap, and new elements are rejected with one comparison. — Neil G, Oct 08 '13 at 21:01
yes but a heap does more: it allows the insertion/removing of an new element and finding the new minimum in O(log(n)) steps. I stated that a heap is not necessary because: 1)only a small number of elements will be inserted. 2) the O-Notation does not say anything about the size of the constant factor. Maybe the usage of a 100 element array has almost the same performance then maintaining a 100 element heap. — miracle173, Oct 09 '13 at 15:23
I understand what you're saying, but even if you go by absolute number of comparisons rather than asymptotic number of comparisons, the array is still much slower because the time to "insert new element, discard old minimum, and find new minimum" is 100 rather than about 7. — Neil G, Oct 09 '13 at 19:40
@Neil G: I added some estimates for the performance so one can see that there is only a difference by 0,1% in the number of comparisons in this special situation. But even for the single insert of a new minimum it is not clear what performs better in reality for 100 elements. I added some arguments. I even don't believe that there are only about 7 comparisons in the worst case: if a new value is inserted at the root node there must be at least two comparison to decide in which subtree should be used for storeing the new value. so we have about 14 comparisons. — miracle173, Oct 10 '13 at 20:11
Okay, but your estimate is very roundabout. You can directly calculate the expected number of inserts to be k(digamma(n) - digamma(k)), which is less than klog(n). In any case, both the heap and the array solution spend only one comparison to discard an element. The only difference is the number of comparisons for an inserted element is 100 for your solution versus up to 14 for the heap (although the average case is probably much less.) — Neil G, Oct 10 '13 at 20:53

score 1 · Answer 14 · answered Oct 08 '13 at 20:33

I see a lot of O(N) discussions, so I propose something different just for the thought exercise.

Is there any known information about the nature of these numbers? If it's random in nature, then go no further and look at the other answers. You won't get any better results than they do.

However! See if whatever list-populating mechanism populated that list in a particular order. Are they in a well-defined pattern where you can know with certainty that the largest magnitude of numbers will be found in a certain region of the list or on a certain interval? There may be a pattern to it. If that is so, for example if they are guaranteed to be in some sort of normal distribution with the characteristic hump in the middle, always have repeating upward trends among defined subsets, have a prolonged spike at some time T in the middle of the data set like perhaps an incidence of insider trading or equipment failure, or maybe just have a "spike" every Nth number as in analysis of forces after a catastrophe, you can reduce the number of records you have to check significantly.

There's some food for thought anyway. Maybe this will help you give future interviewers a thoughtful answer. I know I would be impressed if someone asked me such a question in response to a problem like this - it would tell me that they are thinking of optimization. Just recognize that there may not always be a possibility to optimize.

score 1 · Answer 15 · 2013-10-09T00:55:48.700

Inspired by @ron teller's answer, here is a barebones C program to do what you want.

#include <stdlib.h>
#include <stdio.h>

#define TOTAL_NUMBERS 1000000000
#define N_TOP_NUMBERS 100

int 
compare_function(const void *first, const void *second)
{
    int a = *((int *) first);
    int b = *((int *) second);
    if (a > b){
        return 1;
    }
    if (a < b){
        return -1;
    }
    return 0;
}

int 
main(int argc, char ** argv)
{
    if(argc != 2){
        printf("please supply a path to a binary file containing 1000000000"
               "integers of this machine's wordlength and endianness\n");
        exit(1);
    }
    FILE * f = fopen(argv[1], "r");
    if(!f){
        exit(1);
    }
    int top100[N_TOP_NUMBERS] = {0};
    int sorts = 0;
    for (int i = 0; i < TOTAL_NUMBERS; i++){
        int number;
        int ok;
        ok = fread(&number, sizeof(int), 1, f);
        if(!ok){
            printf("not enough numbers!\n");
            break;
        }
        if(number > top100[0]){
            sorts++;
            top100[0] = number;
            qsort(top100, N_TOP_NUMBERS, sizeof(int), compare_function);
        }

    }
    printf("%d sorts made\n"
    "the top 100 integers in %s are:\n",
    sorts, argv[1] );
    for (int i = 0; i < N_TOP_NUMBERS; i++){
        printf("%d\n", top100[i]);
    }
    fclose(f);
    exit(0);
}

On my machine (core i3 with a fast SSD) it takes 25 seconds, and 1724 sorts. I generated a binary file with dd if=/dev/urandom/ count=1000000000 bs=1 for this run.

Obviously, there are performance issues with reading only 4 bytes at a time - from disk, but this is for example's sake. On the plus side, very little memory is needed.

score 1 · Answer 16 · answered Oct 09 '13 at 15:37

1

You can do it in O(n) time. Just iterate through the list and keep track of the 100 biggest numbers you've seen at any given point and the minimum value in that group. When you find a new number bigger the smallest of your ten, then replace it and update your new min value of the 100 (may take a constant time of 100 to determine this each time you do it, but this does not affect the overall analysis).

answered Oct 09 '13 at 15:37

James Oravec

19,579
27
94
160

1

This approach is almost identical to both the most- and second-most-upvoted answers to this question. – Bernhard Barker Oct 09 '13 at 16:11

score 1 · Answer 17 · answered Oct 25 '13 at 12:49

 Although in this question we should search for top 100 numbers, I will 
 generalize things and write x. Still, I will treat x as constant value.

Algorithm Biggest x elements from n:

I will call return value LIST. It is a set of x elements (in my opinion that should be linked list)

First x elements are taken from pool "as they come" and sorted in LIST (this is done in constant time since x is treated as constant - O( x log(x) ) time)
For every element that comes next we check if it is bigger than smallest element in LIST and if is we pop out the smallest and insert current element to LIST. Since that is ordered list every element should find its place in logarithmic time (binary search) and since it is ordered list insertion is not a problem. Every step is also done in constant time ( O(log(x) ) time ).

So, what is the worst case scenario?

x log(x) + (n-x)(log(x)+1) = nlog(x) + n - x

So that is O(n) time for worst case. The +1 is the checking if number is greater than smallest one in LIST. Expected time for average case will depend on mathematical distribution of those n elements.

Possible improvements

This algorithm can be slightly improved for worst case scenario but IMHO (I can not prove this claim) that will degrade average behavior. Asymptotic behavior will be the same.

Improvement in this algorithm will be that we will not check if element is greater than smallest. For each element we will try to insert it and if it is smaller than smallest we will disregard it. Although that sounds preposterous if we regard only the worst case scenario we will have

x log(x) + (n-x)log(x) = nlog(x)

operations.

For this use case I don't see any further improvements. Yet you must ask yourself - what if I have to do this more than log(n) times and for different x-es? Obviously we would sort that array in O(n log(n)) and take our x element whenever we need them.

score 1 · Answer 18 · answered Nov 24 '17 at 10:55

Finding the top 100 out of a billion numbers is best done using min-heap of 100 elements.

First prime the min-heap with the first 100 numbers encountered. min-heap will store the smallest of the first 100 numbers at the root (top).

Now as you go along the rest of the numbers only compare them with the root (smallest of the 100).

If the new number encountered is larger than root of min-heap replace the root with that number otherwise ignore it.

As part of the insertion of the new number in min-heap the smallest number in the heap will come to the top (root).

Once we have gone through all the numbers we will have the largest 100 numbers in the min-heap.

Shashank · Answer 19 · 2013-10-07T20:35:36.717

I have written up a simple solution in Python in case anyone is interested. It uses the bisect module and a temporary return list which it keeps sorted. This is similar to a priority queue implementation.

import bisect

def kLargest(A, k):
    '''returns list of k largest integers in A'''
    ret = []
    for i, a in enumerate(A):
        # For first k elements, simply construct sorted temp list
        # It is treated similarly to a priority queue
        if i < k:
            bisect.insort(ret, a) # properly inserts a into sorted list ret
        # Iterate over rest of array
        # Replace and update return array when more optimal element is found
        else:
            if a > ret[0]:
                del ret[0] # pop min element off queue
                bisect.insort(ret, a) # properly inserts a into sorted list ret
    return ret

Usage with 100,000,000 elements and worst-case input which is a sorted list:

>>> from so import kLargest
>>> kLargest(range(100000000), 100)
[99999900, 99999901, 99999902, 99999903, 99999904, 99999905, 99999906, 99999907,
 99999908, 99999909, 99999910, 99999911, 99999912, 99999913, 99999914, 99999915,
 99999916, 99999917, 99999918, 99999919, 99999920, 99999921, 99999922, 99999923,
 99999924, 99999925, 99999926, 99999927, 99999928, 99999929, 99999930, 99999931,
 99999932, 99999933, 99999934, 99999935, 99999936, 99999937, 99999938, 99999939,
 99999940, 99999941, 99999942, 99999943, 99999944, 99999945, 99999946, 99999947,
 99999948, 99999949, 99999950, 99999951, 99999952, 99999953, 99999954, 99999955,
 99999956, 99999957, 99999958, 99999959, 99999960, 99999961, 99999962, 99999963,
 99999964, 99999965, 99999966, 99999967, 99999968, 99999969, 99999970, 99999971,
 99999972, 99999973, 99999974, 99999975, 99999976, 99999977, 99999978, 99999979,
 99999980, 99999981, 99999982, 99999983, 99999984, 99999985, 99999986, 99999987,
 99999988, 99999989, 99999990, 99999991, 99999992, 99999993, 99999994, 99999995,
 99999996, 99999997, 99999998, 99999999]

It took about 40 seconds to calculate this for 100,000,000 elements so I'm scared to do it for 1 billion. To be fair though, I was feeding it the worst-case input (ironically an array that is already sorted).

A heap costs `O(log K)` for delete + insert. An array costs `O(K)` to insert into an already-sorted array (like one pass of of Insertion Sort), so that's still better than `O(K log K)` to fully sort without taking advantage of it already being sorted. `del ret[0]` probably copies the array, or in-place copies down by one; removing from the end might be cheaper. So yeah, with worst-case input that runs `insort` for every element, in Python, I'm not surprised it was slow. — Peter Cordes, Jul 20 '22 at 21:35

MirroredFate · Answer 20 · 2013-10-09T16:00:34.120

I know this might get buried, but here is my idea for a variation on a radix MSD.

pseudo-code:

//billion is the array of 1 billion numbers
int[] billion = getMyBillionNumbers();
//this assumes these are 32-bit integers and we are using hex digits
int[][] mynums = int[8][16];

for number in billion
    putInTop100Array(number)

function putInTop100Array(number){
    //basically if we got past all the digits successfully
    if(number == null)
        return true;
    msdIdx = getMsdIdx(number);
    msd = getMsd(number);
    //check if the idx above where we are is already full
    if(mynums[msdIdx][msd+1] > 99) {
        return false;
    } else if(putInTop100Array(removeMSD(number)){
        mynums[msdIdx][msd]++;
        //we've found 100 digits here, no need to keep looking below where we are
        if(mynums[msdIdx][msd] > 99){
           for(int i = 0; i < mds; i++){
              //making it 101 just so we can tell the difference
              //between numbers where we actually found 101, and 
              //where we just set it
              mynums[msdIdx][i] = 101;
           }
        }
        return true;
    }
    return false;
}

The function getMsdIdx(int num) would return the index of the most significant digit (non-zero). The function getMsd(int num) would return the most significant digit. The funciton removeMSD(int num) would remove the most significant digit from a number and return the number (or return null if there was nothing left after removing the most significant digit).

Once this is done, all that is left is traversing mynums to grab the top 100 digits. This would be something like:

int[] nums = int[100];
int idx = 0;
for(int i = 7; i >= 0; i--){
    int timesAdded = 0;
    for(int j = 16; j >=0 && timesAdded < 100; j--){
        for(int k = mynums[i][j]; k > 0; k--){
            nums[idx] += j;
            timesAdded++;
            idx++;
        }
    }
}

I should note that although the above looks like it has high time complexity, it will really only be around O(7*100).

A quick explanation of what this is trying to do: Essentially this system is trying to use every digit in a 2d-array based upon the index of the digit in the number, and the digit's value. It uses these as indexes to keep track of how many numbers of that value have been inserted in the array. When 100 has been reached, it closes off all "lower branches".

The time of this algorithm is something like O(billion*log(16)*7)+O(100). I could be wrong about that. Also it is very likely this needs debugging as it is kinda complex and I just wrote it off the top of my head.

EDIT: Downvotes without explanation are not helpful. If you think this answer is incorrect, please leave a comment why. Pretty sure that StackOverflow even tells you to do so when you downvote.

Interesting idea to just count in one pass, instead of actually doing RadixSort and appending numbers to arrays. Yeah this looks plausible, although the heap-based priority queue method is probably better if K is small (like 100 here), and if the data is uniformly distributed (so new candidates are rare). — Peter Cordes, Jul 20 '22 at 21:48

Khaled.K · Answer 21 · 2013-10-09T06:24:28.267

Time ~ O(100 * N)
Space ~ O(100 + N)

Create an empty list of 100 empty slot
For every number in input-list:
- If the number is smaller than the first one, skip
- Otherwise replace it with this number
- Then, push the number through adjacent swap; until it's smaller than the next one
Return the list

Note: if the log(input-list.size) + c < 100, then the optimal way is to sort the input-list, then split first 100 items.

Panduranga Rao Sadhu · Answer 22 · 2013-10-09T12:56:53.597

Another O(n) algorithm -

The algorithm finds the largest 100 by elimination

consider all the million numbers in their binary representation. Start from the most significant bit. Finding if the MSB is 1 can be a done by a boolean operation multiplication with an appropriate number. If there are more than 100 1's in these million eliminate the other numbers with zeros. Now of the remaining numbers proceed with the next most significant bit. keep a count of the number of remaining numbers after elimination and proceed as long as this number is greater than 100.

The major boolean operation can be an parallely done on GPUs

score 0 · Answer 23 · answered Oct 09 '13 at 15:11

I would find out who had the time to put a billion numbers into an array and fire him. Must work for government. At least if you had a linked list you could insert a number into the middle without moving half a billion to make room. Even better a Btree allows for a binary search. Each comparison eliminates half of your total. A hash algorithm would allow you to populate the data structure like a checkerboard but not so good for sparse data. As it is your best bet is to have a solution array of 100 integers and keep track of the lowest number in your solution array so you can replace it when you come across a higher number in the original array. You would have to look at every element in the original array assuming it is not sorted to begin with.

score 0 · Answer 24 · answered Oct 09 '13 at 16:32

0

Managing a separate list is extra work and you have to move things around the whole list every time you find another replacement. Just qsort it and take the top 100.

answered Oct 09 '13 at 16:32

Chris Fox

641
6
10

-1 quicksort is O(n log n) which is exactly what the OP did and is asking to improve upon. You don't need to manage a separate list, only a list of 100 numbers. Your suggestion also has the unwelcome side effect of changing the original list, or copying it. That's 4GiB or so of memory, gone. – Oct 09 '13 at 17:10

score 0 · Answer 25 · answered Oct 11 '13 at 08:01

Use nth-element to get the 100'th element O(n)
Iterate the second time but only once and output every element that is greater than this specific element.

Please note esp. the second step might be easy to compute in parallel! And it will also be efficiently when you need a million biggest elements.

score 0 · Answer 26 · edited Oct 16 '13 at 12:20

It's a question from Google or some else industry giants.Maybe the following code is the right answer expected by your interviewer. The time cost and space cost depend on the maximum number in the input array.For 32-Bit int array input, The maximum space cost is 4 * 125M Bytes, Time cost is 5 * Billion.

public class TopNumber {
    public static void main(String[] args) {
        final int input[] = {2389,8922,3382,6982,5231,8934
                            ,4322,7922,6892,5224,4829,3829
                            ,6892,6872,4682,6723,8923,3492};
        //One int(4 bytes) hold 32 = 2^5 value,
        //About 4 * 125M Bytes
        //int sort[] = new int[1 << (32 - 5)];
        //Allocate small array for local test
        int sort[] = new int[1000];
        //Set all bit to 0
        for(int index = 0; index < sort.length; index++){
            sort[index] = 0;
        }
        for(int number : input){
            sort[number >>> 5] |= (1 << (number % 32));
        }
        int topNum = 0;
        outer:
        for(int index = sort.length - 1; index >= 0; index--){
            if(0 != sort[index]){
                for(int bit = 31; bit >= 0; bit--){
                    if(0 != (sort[index] & (1 << bit))){
                        System.out.println((index << 5) + bit);
                        topNum++;
                        if(topNum >= 3){
                            break outer;
                        }
                    }
                }
            }
        }
    }
}

The 1 billion input numbers are allowed to contain duplicates, so a bitmap won't work reliably. (This *would* work if duplicates weren't allowed.) You need at least 1 byte counters for a histogram, if you use saturating counts (that clamp at +127 or 255), since K=100 in this case. So that's 4GiB of counters, equal to the total size of the input array, and thus a lot of work to zero out and count in. (Cache misses and TLB misses.) — Peter Cordes, Jul 26 '22 at 22:05

score 0 · Answer 27 · answered May 11 '15 at 21:04

i did my own code,not sure if its what the "interviewer" it's looking

private static final int MAX=100;
 PriorityQueue<Integer> queue = new PriorityQueue<>(MAX);
        queue.add(array[0]);
        for (int i=1;i<array.length;i++)
        {

            if(queue.peek()<array[i])
            {
                if(queue.size() >=MAX)
                {
                    queue.poll();
                }
                queue.add(array[i]);

            }

        }

When there are less than `MAX-1` values above `array[0]`, `queue` will have less than `MAX` elements. — greybeard, Jul 21 '22 at 05:16

score 0 · Answer 28 · answered Aug 02 '16 at 01:27

Possible improvements.

If the file contains 1 billions number, reading it could be really long...

To improve this working you can :

Split the file into n parts, Create n threads, make n threads look each for the 100 biggest numbers in their part of the file (using the priority queue), and finally get the 100 biggest numbers of all threads output.
Use a cluster to do a such task, with a solution like hadoop. Here you can split the file even more and have the output quicker for a 1 billion (or a 10^12) numbers file.

score 0 · Answer 29 · answered Feb 14 '19 at 03:19

First take 1000 elements and add them in a max heap. Now take out the first max 100 elements and store it somewhere. Now pick next 900 elements from the file and add them in the heap along with the last 100 highest element.

Keep repeating this process of picking up 100 elements from the heap and adding 900 elements from the file.

The final pick of 100 elements will give us the maximum 100 elements from a billion of numbers.

score -1 · Answer 30 · answered Oct 09 '13 at 08:42

THe complexity is O(N)

First create an array of 100 ints initialiaze the first element of this array as the first element of the N values, keep track of the index of the current element with a another variable, call it CurrentBig

Iterate though the N values

if N[i] > M[CurrentBig] {

M[CurrentBig]=N[i]; ( overwrite the current value with the newly found larger number)

CurrentBig++;      ( go to the next position in the M array)

CurrentBig %= 100; ( modulo arithmetic saves you from using lists/hashes etc.)

M[CurrentBig]=N[i];    ( pick up the current value again to use it for the next Iteration of the N array)

}

when done , print the M array from CurrentBig 100 times modulo 100 :-) For the student: make sure that the last line of the code does not trump valid data right before the code exits

Umer Farooq · Answer 31 · 2013-10-09T16:00:18.400

This code is for finding N largest numbers in an Unsorted array.

#include <iostream>


using namespace std;

#define Array_Size 5 // No Of Largest Numbers To Find
#define BILLION 10000000000

void findLargest(int max[], int array[]);
int checkDup(int temp, int max[]);

int main() {


        int array[BILLION] // contains data

        int i=0, temp;

        int max[Array_Size];


        findLargest(max,array); 


        cout<< "The "<< Array_Size<< " largest numbers in the array are: \n";

        for(i=0; i< Array_Size; i++)
            cout<< max[i] << endl;

        return 0;
    }




void findLargest(int max[], int array[])
{
    int i,temp,res;

    for(int k=0; k< Array_Size; k++)
    {
           i=0;

        while(i < BILLION)
        {
            for(int j=0; j< Array_Size ; j++)
            {
                temp = array[i];

                 res= checkDup(temp,max);

                if(res == 0 && max[j] < temp)
                    max[j] = temp;
            }

            i++;
        }
    }
}


int checkDup(int temp, int max[])
{
    for(int i=0; i<N_O_L_N_T_F; i++)
    {
        if(max[i] == temp)
            return -1;
    }

    return 0;
}

This might not be the efficient one but gets the job done.

Hope this helps

"This might not be the efficient one but gets the job done." `std::nth_element(array, array+Array_Size, array+BILLION, std::greater{});` gets the job done as well (the element `array[Array_Size-1]` will contain the `Array_Size`th largest element, and all following elements will be smaller or equal). — dyp, Oct 10 '13 at 08:16

score -2 · Answer 32 · answered Oct 08 '13 at 14:47

-2

Problem: Find m largest elements of n items where n >>> m

The simplest solution, that should be obvious to everyone is to simply do m passes of the bubble sort algorithm.

then print out the last n elements of the array.

This requires no external data structures, and uses an algorithm that everyone knows.

Running time estimate is O(m*n). The best answers so far is O(n log(m)), so this solution is not significantly more expensive for small m.

I'm not saying this couldn't be improved, but this is by far the simplest solution.

answered Oct 08 '13 at 14:47

Chris Cudmore

29,793
12
57
94

2

No external data structures? What about the billion number array to sort? An array of this size is a huge overhead in both time to fill and space to store. What if all the "big" numbers were at the wrong end of the array? You would need on the order of 100 billion swaps to "bubble" them into position - another large overhead... Finally, M*N = 100 billion vs M*Log2(N) = 6.64 billion which is nearly two orders of magnitude difference. Maybe re-think this one. A one pass scan while maintaining a data structure of the largest numbers is going to significantly out perform this approach. – NealB Oct 08 '13 at 16:34

score -4 · Answer 33 · answered Oct 15 '13 at 19:35

Recently I am adapting a theory that all the problems in the world could be solved with O(1). And even this one. It wasn't clear from the question what is the range of the numbers. If the numbers are it range from 1 to 10, then probably the the top 100 largest numbers will be a group of 10. The chance that the highest number will be picked out of the 1 billion numbers when the highest number is very small in compare to to 1 billion are very big. So I would give this as an answer in that interview.

var testData = new int[1000000000]; – NPSF3000 Oct 25 '13 at 13:35 — NPSF3000, Oct 25 '13 at 13:35

Write a program to find 100 largest numbers out of an array of 1 billion numbers

33 Answers33

Linked

Related