-2

How do I sort a very large file containing some 10 million records (JSON records) with size around 6 GB based on keys.

The solution should be memory optimised. I mean, there are ways to put the data into Collection and sort, but that consumes lot of heap size causing time barriers.

Please suggest some generic memory optimised sorting technique wherein we can pass a JSON file and some key values and the sort type and it returns a sorted file.

For example

File input.json

{
    "name":"rohit", "age":20, ....
}
{
    "name":"sourav", age":32, ....
}
.
.
.
//some 10 million records

So, suppose key is age, it and type is desc, it should return a file in desc order sorted on age.

Rohit Mishra
  • 281
  • 5
  • 16
  • 2
    Sounds like need a DB... ;) – Nir Alfasi Feb 14 '18 at 17:08
  • 3
    what have you tried so far? maybe try it to figure it out for yourself first and then come back when you run into problems :) – RAZ_Muh_Taz Feb 14 '18 at 17:10
  • You might be lucky to be able to keep the parsed data of the hole file in RAM (4GB (tops 8GB) should be enough). – MrSmith42 Feb 14 '18 at 17:13
  • Look int to hadoop map reduce,it may help.. There are some examples or check this link https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html – nikunjM Feb 14 '18 at 17:14
  • 1
    What about [sorting on disk](http://www.codeodor.com/index.cfm/2007/5/10/Sorting-really-BIG-files/1194)? – SurfMan Feb 14 '18 at 17:14

2 Answers2

1

You could try merge sort, i.e. storing smaller chunks of file

tsingh
  • 361
  • 1
  • 5
  • 17
1

Your requirement is not so simple. First of all it is 6GB file and second challenge is to sort it. You need to first split the file into small files. Now need to write a proper algorithm or procedural requirement. Read each file and sort base on your algorithm and write into file. Each newly created file should have only one specific sorting information. For example if key type green write that sorting information in a green.sort file and finally merged all files and make it one.

Abhijit Pritam Dutta
  • 5,521
  • 2
  • 11
  • 17