0

I recently attended an interview. And i was asked the following question:

You have text file with different values one per line. How to find the one value that is present in the maximum number of lines/entries?

I gave hashmap as a solution with constant time complexity.

Then the interviewer changed the question and asked what if the file has 10 billion rows?

Now i can't use a hashmap. And i was unable to think of an approach. Can anybody suggest a way.

Is there any way to group same items together?

user2565192
  • 694
  • 1
  • 8
  • 19

3 Answers3

4

You could sort the file and then do one pass that only requires O(1) memory.

whrow
  • 206
  • 1
  • 4
  • How to sort a file with `O(1)` memory? – throwit Nov 18 '15 at 08:45
  • The pass requires O(1) memory. The sort can use however much is available, for example a merge sort can be implemented nicely with temporary files. – whrow Nov 18 '15 at 08:51
1

If the range of values is limited to 32bit integers, a simple way would be to keep a 4GiB array of 8bit saturating counters.

  • After one pass, if only one counter hit 255, then that's the value repeated the most.
  • otherwise, record the values for all the counters that saturated to 255.
  • Make another pass through the file, only updating 64bit counters for the values that you recorded. (ignore others).

You could convert to using longer counters on the fly to keep it a one-pass algo. 255 is a sentinel value for a counter, meaning that you should instead refer to a hashmap of values -> 64bit counters.

You could use 4-bit saturating counters if 4GiB is too much, but then more of your counters will saturate, and they'll be slower to update (although memory will still be the bottleneck, regardless of the extra instructions to shift/mask/recombine-with-old-value).

There's no point in using a multi-level approach (1 bit saturating counters, then 8-bit saturating counters, ...), because all levels after the first have to be sparse (or there's no point). The per-entry overhead of a sparse map, like a hash or tree, is going to be dominate the size of the actual counter, so use as much memory as you can for the dense first-level, then fall back to a hashmap of 64bit counters for the 2nd level.


If a dense array of counters isn't viable at all (e.g. long numbers)

Sort in batches while counting duplicates, then merge those batches. See for example my answer to Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames) for suggestions about how to maximize efficiency when batching things up. That was targeted at strings, not integers, but approaches like Tries for counting duplicates on the fly with good space efficiency will work even better for strings of digits than for arbitrary strings. A Radix Trie (nodes can represent strings, not just characters) might be more trouble than it's worth for such a small alphabet.

In any case, if sorting, count duplicates until you've used as much memory as you have available before writing a batch. Every duplicate you find and count in the first pass is one that won't have to get merged later.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

Divide all values into sevral files according to its hashvalue, then use hashmap for each file.

And, the time complexity is O(n) not O(1)

throwit
  • 714
  • 3
  • 13
  • You could bucket according to first 2 or 3 digits, and not waste time hashing. – Peter Cordes Nov 18 '15 at 08:07
  • you don't understand the question. the question is: how the find a value which occure most times? just like `1, 2, 3, 4, 5, 1, 4, 1`, the number should be `1`, because there are there `1`. – BlackMamba Nov 18 '15 at 08:14
  • No, you read it right the first time. The value that's present in the maximum **number of lines**. I.e. the most-repeated value. – Peter Cordes Nov 18 '15 at 08:14