3

Want to SORT 1 BILLION of integer numbers and my system has just 1 GB of RAM.What could be the fastest and efficient way to sort?

  1. Say we have an input in a text file an integer per line.

  2. We are using java program to sort.

  3. I have specified RAM as we cannot hold all the input integers in the RAM.

Update: Integers are 7 digit numbers.

samarth
  • 3,866
  • 7
  • 45
  • 60
  • 1
    Possible duplicate: http://stackoverflow.com/questions/4127030/how-to-sort-million-billion-integers – Layke Aug 26 '11 at 10:10
  • 1
    See the comparison here: http://en.wikipedia.org/wiki/Sorting_algorithms#Comparison_of_algorithms. Take your pick! – Oliver Charlesworth Aug 26 '11 at 10:11
  • @Layke: the other question does not have space constraint. – amit Aug 26 '11 at 10:11
  • 1
    There is same kind of question answered http://stackoverflow.com/questions/2087469/sort-a-file-with-huge-volume-of-data-given-memory-constraint – Justin Aug 26 '11 at 10:12
  • how large are those integers? are duplicates possible? do you have external storage for temporary files? – Thilo Aug 26 '11 at 10:15
  • Thank you all ,it was not that hard as i was thinking. – samarth Aug 26 '11 at 10:30
  • @samarth: If they told you about the seven digits, then they won't like your accepted answer about merge-sort using the disk. – Thilo Aug 26 '11 at 10:39

5 Answers5

8

Integers are 7 digit numbers.

So there are only 10 million possible values.

You have 1GB of RAM. Make an array of counters, one for each possible value.

Read through the file once, count up the counters.

When done, output the numbers according to the final counter values.

Every number can occur at most 1 billion times. So a 32bit counter would be enough. This means a 10M x 4 bytes = 40M byte array.

Thilo
  • 257,207
  • 101
  • 511
  • 656
3

The simplest thing to do is break the input into smaller files that can fit in memory and sort each, and then merge the results.

Guido van Rossum has a good description of doing this in python while obviously not the same language the principle is the same.

Gareth Davis
  • 27,701
  • 12
  • 73
  • 106
  • Only works if external storage (or at least updating the input file in place) is allowed. – Thilo Aug 26 '11 at 10:29
  • 3
    If you have only 10 million possible values you can keep a count of them in memory with between 1.25 MB and 40 MB depending on how you do it. – Peter Lawrey Aug 26 '11 at 10:42
  • @Peter I know. But his third point contradicts his 1GB of RAM statement, apparently he can't fit them in memory...may be they are very very big integers. – Gareth Davis Aug 26 '11 at 10:53
  • 1
    @Gareth Davis - what he is saying (I think) is that he can't hold the *unsorted* list list in memory. – Stephen C Aug 26 '11 at 11:03
  • @Gareth, On the topic of contradictions, he appears be talking about very big "7 digit numbers." ;) If there is a very large number of them, there has to be a lot of duplicates before there is too many to fit into memory. – Peter Lawrey Aug 26 '11 at 11:22
  • @Stephen C, "list list" is a flash back to "long long" ;) – Peter Lawrey Aug 26 '11 at 11:23
  • 2
    This is not the best solution if you are limited to 7 digit numbers, see solution form Stephen C or Thilo. This solution will do in linear time instead of n log(n) time and it will also preserve memory. – Nicolas Bousquet Aug 26 '11 at 11:40
2

You specified that are sorting a billion 7 (decimal) digit numbers.

If there were no duplicates, you could sort in memory with 107 BITS using radix sort. Since you must have duplicates (107 less than 109), you could implement radix sort using (say) an array of 107 8-bit counters, with a HashMap<Integer, Integer> to deal with the relatively few cases where the counters overflow. Or just an array of 107 32-bit counters.

Another more general approach (that works for any kind of value) is to split the file into N smaller subfiles, sort each subfile in memory, and then perform an N-way merge of the sorted subfiles.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

Using a BitSet with 4 billion possible values occupies 512 MB. Just set all the int values you see and write them out in order (they are naturally sorted)

This only works if you don't care about duplicates.

If counting duplicates matters I would still consider either a memory mapped file for counting, or using a merge sort of sorted subsections of data. (I believe the later is an expected answer)

I recently bough a 24 GB PC for under £1K, so a few GB isn't that much unless you limited by a hosted solution. (Or using a mobile device)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    If you have 1 billion numbers from a set of 10 million possible values, at least one of them has to occur more than once. A BitSet won't be enough. – Thilo Aug 26 '11 at 10:35
  • It depends on whether you mean `int` or `Integer` when you say "integer". If that is so, you can only have 2^32 values. If you have more possible values, e.g. 10 billion, you can use a memory mapped file. I have written code which updates a custom bit with 128 billion bits. – Peter Lawrey Aug 26 '11 at 10:39
  • If they only are 7 digit numbers you only have 10 million values with requires 10 million bits or about 1.25 MB of memory. Instead of bit you can use 32-bit counts and still only use 40 MB of memory. – Peter Lawrey Aug 26 '11 at 10:40
0

Assuming every integer occurs exactly one time you can read the file and for every number you find you set a bit - the bit array has to hold 10000000 bits - this uses only 1,28 MB RAM which should be available... after you have read all integers you just go through the array and output the numbers where a bit ist set...

Yahia
  • 69,653
  • 9
  • 115
  • 144
  • If you have 1 billion numbers from a set of 10 million possible values, at least one of them has to occur more than once. – Thilo Aug 26 '11 at 10:34