So I'm using Java to do multi-way external merge sorts of large on-disk files of line-delimited tuples. Batches of tuples are read into a TreeSet
, which are then dumped into on-disk sorted batches. Once all of the data have been exhausted, these batches are then merge-sorted to the output.
Currently I'm using magic numbers for figuring out how many tuples we can fit into memory. This is based on a static figure indicating how may tuples can be roughly fit per MB of heap space, and how much heap space is available using:
long max = Runtime.getRuntime().maxMemory();
long used = Runtime.getRuntime().totalMemory();
long free = Runtime.getRuntime().freeMemory();
long space = free + (max - used);
However, this does not always work so well since we may be sorting different length tuples (for which the static tuple-per-MB figure might be too conservative) and I now want to use flyweight patterns to jam more in there, which may make the figure even more variable.
So I'm looking for a better way to fill the heap-space to the brim. Ideally the solution should be:
- reliable (no risk of heap-space exceptions)
- flexible (not based on static numbers)
- efficient (e.g., not polling runtime memory estimates after every tuple)
Any ideas?