0

I am looking for some ideas idea on sorting large amount of strings from an input file and print out the sorted results to a new file in Java. The requirement is that the input file could be extremely large. I need to consider the performance in the solution, so any ideas?

NPE
  • 486,780
  • 108
  • 951
  • 1,012
Ruper
  • 125
  • 1
  • 4
  • 9
  • possible duplicate of [How do I sort very large files](http://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files) – Matthew Farwell Feb 03 '12 at 16:34
  • This is a [very](http://stackoverflow.com/questions/2087469/sort-a-file-with-huge-volume-of-data-given-memory-constraint) [common](http://stackoverflow.com/questions/8832822/sorting-lines-of-an-enormous-file-txt-in-java) [question](http://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files). – Dmitri Feb 03 '12 at 16:37
  • Please define "extremely large" - what order of magnitude? GB? TB? More? – DNA Feb 03 '12 at 17:01

3 Answers3

3

External Sorting technique is generally used to sort huge amounts of data. May be this is what you need.

externalsortinginjava is the java library for this.

Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327
1

Is an SQL database available? If you inserted all the data into a table, with the sortable column or section indexed, you may (or may not) be able to output the sorted result more efficiently. This solution may also be helpful if the amount of data, outweighs the amount of RAM available.

It would be interesting to know how large, and what the purpose is.

700 Software
  • 85,281
  • 83
  • 234
  • 341
  • Seems like a lot of overhead just to sort something. What is the benefit of SQL here, over any simple B-tree implementation? – Dmitri Feb 03 '12 at 16:40
1

Break the file into amounts you can read in memory. Sort each amount and write to a file. (If you could fit everything into memory you are done) Merge sort the resulting files into a single sorted file.

You can also do a form of radix sort to improve CPU efficiency, but the main bottleneck is all the re-writing and re-reading you have to do.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130