Sorting large file with memory constraints - external sort

Question

I have been doing a lot of reading on this idea and I have read many different people asking questions about it. I came across: http://javatroops.blogspot.ca/2012/12/sort-very-large-text-file-on-limited.html and I had some questions about it. (Sorry I am unsure on the exact rules on linking blogs and what not).

I understand the basis of external sorting where I am breaking down the file into temporary files and then performing a merge sort on the number of temporary files. However, this method would stress the I/O wouldn't it because of all the files? I could limit the number of files opened at one time by doing multiple passes of n-way merge, but I wished to find improvements.

So reading that blog post above, I understand the idea behind the Solution 2, less so Solution 3 but I do not understand its implementation in Java. In the middle of both solutions, after the file has been split into temporary files, all the files are read into BufferedReaders (mini question about bufferedreaders and filereaders: is it just that it adds buffering and makes I/O faster but doesn't change usability from a programmer's perspective?) and then mapped. Does this process of mapping not take up memory? I am confused on how inserting into a TreeSet would take up memory but putting into a TreeMap wouldn't. How does Solution 3 work exactly?

It wouldn't 'stress the I/O because of the number of files'. The number of open files has nothing to do with it. Merging is certainly I/O-intensive, because you're hardly doing anything else except I/O, but that's true whether you are merging two files or 100. It doesn't become 'more true' with increasing N. Using buffered I/O mitigates it to a small extent, and mitigates the number of system calls to a huge extent. — user207421, Aug 10 '14 at 01:00
Solution 2 in the citation is a very poor attempt. He doesn't seem to have heard of replacement-selection or polyphase merging. It can be done far more efficiently than that. The implied Solution 3 with memory-mapped files is entirely impractical, as I discovered several decades ago, — user207421, Aug 10 '14 at 01:45
@EJP, let's say when I need to sort 2GB file, do I benefit from using replacement selection and polyphase merging(as the entire file is present in one disk). Is it not good thing to use Unix sort by means of System.build process — saiki4116, Jul 08 '15 at 19:42
Here is the simple Java Program that implements this. - https://stackoverflow.com/a/49839773/1647320 — sapy, Apr 15 '18 at 21:12
This simple Java program explains the idea https://stackoverflow.com/a/49839773/1647320 — sapy, Apr 15 '18 at 21:23

score 0 · Accepted Answer · answered Aug 09 '14 at 23:41

However, this method would stress the I/O wouldn't it because of all the files? I could limit the number of files opened at one time by doing multiple passes of n-way merge, but I wished to find improvements.

Using memory is always going to be faster, at least 10 - 1000x faster. This is why we use memory and generally try to ensure we have as much as we need. If you are going to use IO heavily, you have to be careful how you access the data as the more you use the IO, the slower you will be.

is it just that it adds buffering and makes I/O faster but doesn't change usability from a programmer's perspective?)

The main advantage is that it support reading lines. (And it does buffering)

Memory mapped files can result in less garbage and overhead depending on how they are used but you are right that it's not very clear.

Memory mapped files use off heap memory and can be more naturally swapped in and out. The problem is that it will use all available free memory transparently. ie. your heap limit doesn't apply so you can pretend you only have 40 MB of heap, but you can't pretend you don't have as much main memory as you do.

How does Solution 3 work exactly?

I suggest you ask the author what he meant exactly.

Sorting large file with memory constraints - external sort

1 Answers1