1

There are 1 billion integers stored in a file. One line per integer. Memory can support loading of 1 million integers at a time. We need to display 100 largest integers.

My thoughts :

  1. Use a max heap data structure of size 100.
  2. Take 1st million integers from the file and put in the heap.
Aditya Joshee
  • 137
  • 1
  • 3
  • 11
  • This is the time to learn about priority queue. – MBo Apr 21 '16 at 08:59
  • @MBo : I thought about priority queue. It is implemented in a heap data structure. I can have a max heap of the first 1 million number I read, but what next? – Aditya Joshee Apr 21 '16 at 09:05
  • 1
    See http://stackoverflow.com/questions/7746648/throwing-the-fattest-people-off-of-an-overloaded-airplane/7746745#7746745 for the general idea. – Jim Mischel Apr 21 '16 at 12:22
  • Possible duplicate of [Retrieving the top 100 numbers from one hundred million of numbers](http://stackoverflow.com/questions/2550624/retrieving-the-top-100-numbers-from-one-hundred-million-of-numbers) – Adrian McCarthy Dec 15 '16 at 00:28

2 Answers2

4

Build min-heap for the first 100 elements.

For every new element check - if it larger than heap top, remove top, insert new element.

Heap size always is 100. So overall complexity is O(N * log(100)) = O(N)
(in common case of k top values - O(N log k))

Million is used just as max size of block that you read from file, then walk through it.

MBo
  • 77,366
  • 5
  • 53
  • 86
2

You just need to iterate over the file once:

  • Have an ordered list of top 100 integers
  • Iterate over the file: If a number is big enough, put it in top 100

Edit: Inserting a new number into the top 100 is O(n) if you use a sorted list and O(log(n)) if you use a heap. So if the performance of the process depends on the insertion it makes sense to use a heap. If it depends mostly on reading the file then it doesn't matter.

fafl
  • 7,222
  • 3
  • 27
  • 50
  • You'll end up doing 100 billion comparisons that way. Should at least keep the structure in order so you can just compare the lowest entry. – Luke Joshua Park Apr 21 '16 at 09:01
  • Nah you only compare to the lowest element. – fafl Apr 21 '16 at 09:02
  • 1
    Provided it's an ordered structure, yeah. Just wrote that in my comment. – Luke Joshua Park Apr 21 '16 at 09:02
  • Why you want to have an ordered list? With min heap, we can get the lowest integer in O(1). We don't need to sort the list for just keeping the track of the current lowest integer. I think @MBo's solution is more efficient. – Aditya Joshee Apr 21 '16 at 10:19
  • Lowest integer is a[99]. How is that not O(1)? And you never need to sort the full list, just do every insert in the right place. – fafl Apr 21 '16 at 11:51
  • 1
    This is going to be very expensive if that "top 100" is a sorted list, because inserting into the list is O(n), where n is the number of items in the list. Sure, determining whether the item should go into the list is O(1), but then you have to iterate over the list to determine in which spot it will go.So the worst case becomes O(n * 100). – Jim Mischel Apr 21 '16 at 12:29
  • Actually inserting only takes O(log(n)) because you can do binary search. I still don't understand why a heap is better. – fafl Apr 21 '16 at 13:11
  • 1
    Ok it seems like finding the right spot to insert is O(log(n)), but actually inserting is O(n) because you have to shift all succeeding numbers. I'll edit my answer. – fafl Apr 21 '16 at 14:26