1

I have 50 text files of around 10 gb of numbers. I have to sort these numbers. My first idea is to use apply Merge Sort i.e. Sort each file separately and merge them. I am using array to load these numbers. and when I run the application my program crashes due lack of memory. So My Question is:

  1. Which data structure to use?
  2. How to manage the memory?
  3. Is merge sort correct approach? if not please suggest the way.

Any help will be appreciated.

Vivek Mishra
  • 1,772
  • 1
  • 17
  • 37
  • This may help: https://stackoverflow.com/questions/4358087/sort-with-the-limited-memory – Cong Ma Apr 28 '17 at 07:40
  • What is the minimum and maximum value of the numbers? – samgak Apr 28 '17 at 07:42
  • yeah infact I created smaller files and sorted them. but I am loading the files to merge them then I am facing the problem. – Vivek Mishra Apr 28 '17 at 07:43
  • @samgak I didn't see the maximum values but numbers go up to 7 digits. and minimum is the 2 digit number but may be single digit also I didn't check. – Vivek Mishra Apr 28 '17 at 07:43
  • 2
    You don't load all the files into memory in order to merge them. Instead, you load one item from each file, find the smallest item, and output it. Then you load one item from the file that one came from and do it all over again. This is best done with a priority queue. See http://stackoverflow.com/q/20802396/56778. Or if you have access to the [Gnu sort utility](http://man7.org/linux/man-pages/man1/sort.1.html), you can have it sort and merge the files. – Jim Mischel Apr 28 '17 at 14:27

1 Answers1

1

If the numbers only go up to 7 digits and are integers, then you can use a Counting Sort.

You will only need around 40Mb of memory, to store 10 million 4 byte integers giving the count how many there are of each of the numbers from 0-9,999,999. If you have to deal with the possibility of more than 2.14 billion duplicates of the same number then you can use 64-bit ints. Initialize the array to zero then read the numbers in one at a time, updating the count for each. Then once you are finished you can generate the sorted list based on the counts.

samgak
  • 23,944
  • 4
  • 60
  • 82
  • Ok Counting sort! I'll check that. but I dont think there are only 10 million numbers as I have 10 gb of files so many numbers will be repeated many times I think! – Vivek Mishra Apr 28 '17 at 07:49
  • 1
    If they are only 7 digits max then there can't be more than 10 million unique numbers. – samgak Apr 28 '17 at 07:51
  • 1 up for this information. I'll check and get back to you. looks like its gonna solve my problem. – Vivek Mishra Apr 28 '17 at 07:53
  • (You may need to at least check for count overflow: with 2+1 "text bytes" per short digit string, 5*10^11 amounts to more numbers than four bytes can count.) – greybeard Apr 28 '17 at 16:47