Java: Filling in-memory sorted batches

Question

So I'm using Java to do multi-way external merge sorts of large on-disk files of line-delimited tuples. Batches of tuples are read into a TreeSet, which are then dumped into on-disk sorted batches. Once all of the data have been exhausted, these batches are then merge-sorted to the output.

Currently I'm using magic numbers for figuring out how many tuples we can fit into memory. This is based on a static figure indicating how may tuples can be roughly fit per MB of heap space, and how much heap space is available using:

long max = Runtime.getRuntime().maxMemory();
long used = Runtime.getRuntime().totalMemory();
long free = Runtime.getRuntime().freeMemory();      
long space = free + (max - used);

However, this does not always work so well since we may be sorting different length tuples (for which the static tuple-per-MB figure might be too conservative) and I now want to use flyweight patterns to jam more in there, which may make the figure even more variable.

So I'm looking for a better way to fill the heap-space to the brim. Ideally the solution should be:

reliable (no risk of heap-space exceptions)
flexible (not based on static numbers)
efficient (e.g., not polling runtime memory estimates after every tuple)

Any ideas?

Add more memory, and just do it all in memory? Take your salary, divide it by ~350, and if the amount of days you're spending optimizing is > the cost of a server with 100+GB of memory, you're doing it wrong. http://www.codinghorror.com/blog/2008/12/hardware-is-cheap-programmers-are-expensive.html — Stefan Kendall, Jun 26 '11 at 15:54
No point telling me: tell that to the person who holds the equipment budget :P — badroit, Jun 26 '11 at 16:05
While "use more memory" is hardly the best approach for large data files (while I've worked with 2tb RAM clusters, I can imagine many of those systems being used to sort large data sets), why not use one of the existing external sort algorithms? Also I don't see how exactly this should work if we have to add tuples of unknown size and stop at the right time (without trying and failing that is). Also if we sort the whole data why care about having tuples together anyhow? Just read Xmb of data and be good? — Voo, Jun 26 '11 at 16:35
@Voo, I've had a look at some of the Java external sorting code packages out there, but haven't found a good one; the ones I find through Google are quite naive. In terms of unknown size, the tuples are tiny (~200 chars) comparative to memory, so it should be feasible; thinking now about just checking memory every 1k tuples or something, but I'm sure there's a cleaner solution, maybe involving weak references or such. With regards sorting tuples, custom comparators can be specified. — badroit, Jun 26 '11 at 17:43
In terms of a typical scenario, I'm looking at sorting sometimes up to 2 billion tuples (with about 200 chars each) in 3GB heap-space. I can do this currently, but the number of batches to merge kills performance... hence I'm looking to jam as much into memory as possible to reduce batch count. — badroit, Jun 26 '11 at 17:49
So, what's wrong with an [external sort](http://code.google.com/p/externalsortinginjava/), like an [external merge sort](http://www.codeodor.com/index.cfm/2007/5/14/Re-Sorting-really-BIG-files---the-Java-source-code/1208) ? Is the additional size an issue ? — DarkDust, Jun 26 '11 at 18:05
@DarkDust, just to clarify, I'm doing an external sort. Part of doing that efficiently is minimising batches by maximising heap-space use. Along those lines my question is looking for a graceful way to fill heap-space to the brim with tuples parsed from a file. — badroit, Jun 26 '11 at 18:17
No real suggestions, but I recall solving this (nearly) exact same problem for the S/38 OS ca 1982. "Mini indexes" we called it, and we were using page-balanced binary radix trees. IIRC, more key to performance than sizing the initial sort was managing the merge phase -- sometimes multiple passes were required. Do be aware that in Java the heap size values are only really reliable after a GC. (I was thinking your formula was wrong, but I guess it's more or less correct, given the inaccuracy of the the whole thing.) — Hot Licks, Jun 26 '11 at 20:20
@Daniel, yep I have a GC call in there before, cheers. We're not actually doing multiple passes at the moment, though it would be worth looking into I guess. Merging >1000 batches in one go is probably not the best idea. — badroit, Jun 26 '11 at 20:55

Dunes · Answer 1 · 2011-06-26T19:51:01.623

Why bother with calculating how many items you can hold? How about letting java tell you when you've used up all your memory, catching the exception and continuing. For example,

    // prepare output medium now so we don't need to worry about having enough 
    // memory once the treeset has been filled.
    BufferedWriter writer = new BufferedWriter(new FileWriter("output"));

    Set<?> set = new TreeSet<?>();
    int linesRead = 0;
    {
        BufferedReader reader = new BufferedReader(new FileReader("input"));
        try {
            String line = reader.readLine();
            while (reader != null) {
                set.add(parseTuple(line));
                linesRead += 1;
                line = reader.readLine();
            }
            // end of file reached
            linesRead = -1;
        } catch (OutOfMemoryError e) {
            // while loop broken
        } finally {
            reader.close();
        }
        // since reader and line were declared in a block their resources will 
        // now be released 
    }

    // output treeset to file
    for (Object o: set) {
        writer.write(o.toString());
    }
    writer.close();

    // use linesRead to find position in file for next pass
    // or continue on to next file, depending on value of linesRead

If you still have trouble with memory, just make the reader's buffer extra large so as to reserve more memory.

The default size for the buffer in a BufferedReader is 4096 bytes. So when finishing reading you will release upwards of 4k of memory. After this your additional memory needs will be minimal. You need enough memory to create an iterator for the set, let's be generous and assume 200 bytes. You will also need memory to store the string output of your tuples (but only temporarily). You say the tuples contain about 200 characters. Let's double that to take account for separators -- 400 characters, which is 800 bytes. So all you really need is an additional 1k bytes. So you're fine as you've just released 4k bytes.

The reason you don't need to worry about the memory used to store the string output of your tuples is because they are short lived and only referred to within the output for loop. Note that the Writer will copy the contents into its buffer and then discard the string. Thus, the next time the garbage collector runs the memory can be reclaimed.

I've checked and, a OOME in add will not leave a TreeSet in an inconsistent state, and the memory allocation for a new Entry (the internal implementation for storing a key/value pair) happens before the internal representation is modified.

Thanks for the response! Had half considered this, but recovering and continuing from an OOME is not a good idea: http://stackoverflow.com/questions/2679330/catching-java-lang-outofmemoryerror — badroit, Jun 26 '11 at 18:18
It's generally a bad idea to catch an OOME, but you have a specific use case here. Plus the above code has been engineered to release a significant portion of memory upon the error being thrown. Catching an OOME is a bad idea generally because additional resources maybe required to handle the error, thus causing the error to be thrown again. And I like I said earlier, this is not your situation. — Dunes, Jun 26 '11 at 18:44
There is another reason catching an OOME is generally a bad idea: The sudden termination of method execution due to an OOME may have left an object in an inconsistent state. How do you know that isn't the case for a TreeSet? Moreover, Garbage Collection in Java is not implemented by reference counting, and since modern garbage collectors compact the heap, contiguousity is not an issue. — meriton, Jun 26 '11 at 19:13
@meriton I had a little check. There's only one place where new is called during a call to TreeSet.add and it is before any modification is performed on the internal state. Everything else is either managed on the stack or in the fields of objects that already exist on the heap. I had forgotten that the Java GC is far more efficient that the C GC though, so your point about reference counting and contiguousity stands. However, the point I was trying to make is that memory would have been made available before additional heap space was required. — Dunes, Jun 26 '11 at 19:43
Ok, that solution should work then. I'm still slightly concerned that the garbage collector will spin for a while before admitting defeat and throwing the OOME, but perhaps I overestimate that problem. In any case: +1. — meriton, Jun 26 '11 at 19:53
Perhaps, but having looked at your answer I'm think the problem might be with the limited amount of heap space and multiple GC runs due to the creation of so many strings. Though 1.6 has Escape Analysis, which means the JVM might decide to allocate the strings to the stack rather than the heap. — Dunes, Jun 26 '11 at 20:51

score 2 · Accepted Answer · answered Jun 26 '11 at 19:41

2

Filling the heap to the brim might be a bad idea due to garbage collector trashing. (As the memory gets nearly full, the efficiency of garbage collection approaches 0, because the effort for collection depends on heap size, but the amount of memory freed depends on the size of the objects identified as unreachable).

However, if you must, can't you simply do it as follows?

for (;;) {
    long freeSpace = getFreeSpace();
    if (freeSpace < 1000000) break;
    for (;;freeSpace > 0) {
        treeSet.add(readRecord());
        freeSpace -= MAX_RECORD_SIZE;
    }
}

The calls to discover the free memory will be rare, so shouldn't tax performance much. For instance, if you have 1 GB heap space, and leave 1MB empty, and MAX_RECORD_SIZE is ten times average record size, getFreeSpace() will be invoked a mere log(1000) / -log(0.9) ~= 66 times.

answered Jun 26 '11 at 19:41

meriton

68,356
14
108
175

I think this is probably the best answer, due to the consideration of how often the garbage collector will need to run. I do find your coding style odd though. Why not just have a single while loop eg. `while (freeSpace > 1000000) { read another record }` – Dunes Jun 26 '11 at 20:59
Your answer gets the accept for being the least scary! It was one I was leaning towards myself... – badroit Jun 26 '11 at 21:03
...perhaps one question that springs to mind on second thought: is the `getFreeSpace()` accurate enough to keep this solution stable? – badroit Jun 26 '11 at 21:11
@Dunes: The question asks for "not polling runtime memory estimates after every tuple". I satisfy that by having an inner loop that operates on a worst-case assessment of how much free memory is left that doesn't require querying the VM. Granted, this is probably a micro-optimization that is not worth the added code complexity, but the question did call for it ... – meriton Jun 26 '11 at 21:23
@badroit: That would depend on how you implement it ;-) Seriously, `Runtime.freeMemory()` doesn't say how accurate the returned value is, so you should test on the target VM to be safe. (Note that you can always tune the algorithm to leave more memory to spare to compensate for any reporting inaccuracy.) – meriton Jun 26 '11 at 21:32
Just to mention that accuracy is not a problem, but the rate of load tends to slow dramatically towards the end. One option is to increase the min-space. However, I've implemented a check to break the loop when the rate of load deteriorates significantly, which works well as an independent check of how full the heap is getting. – badroit Jun 27 '11 at 13:42

score 1 · Answer 3 · answered Jun 26 '11 at 19:44

You can really fill the heap to the brim using direct memory writing (it does exist in Java!). It's in sun.misc.Unsafe, but isn't really recommended for use. See here for more details. I'd probably advise writing some JNI code instead, and using existing C++ algorithms.

score 0 · Answer 4 · answered Jun 26 '11 at 23:52

I'll add this as an idea I was playing around with, involving using a SoftReference as a "sniffer" for low memory.

SoftReference<Byte[]> sniffer = new SoftReference<String>(new Byte[8192]);
while(iter.hasNext()){
   tuple = iter.next();
   treeset.add(tuple);
   if(sniffer.get()==null){
      dump(treeset);
      treeset.clear();
      sniffer = new SoftReference<String>(new Byte[8192]);
   }
}

This might work well in theory, but I don't know the exact behaviour of SoftReference.

All soft references to softly-reachable objects are guaranteed to have been cleared before the virtual machine throws an OutOfMemoryError. Otherwise no constraints are placed upon the time at which a soft reference will be cleared or the order in which a set of such references to different objects will be cleared. Virtual machine implementations are, however, encouraged to bias against clearing recently-created or recently-used soft references.

Would like to hear feedback as it seems to me like an elegant solution, although behaviour might vary between VMs?

Testing on my laptop, I found that it the soft-reference is cleared infrequently, but sometimes is cleared too early, so I'm thinking to combine it with meriton's answer:

SoftReference<Byte[]> sniffer = new SoftReference<String>(new Byte[8192]);
while(iter.hasNext()){
   tuple = iter.next();
   treeset.add(tuple);
   if(sniffer.get()==null){
      free = MemoryManager.estimateFreeSpace();
      if(free < MIN_SAFE_MEMORY){
         dump(treeset);
         treeset.clear();
         sniffer = new SoftReference<String>(new Byte[8192]);
      }
   }
}

Again, thoughts welcome!

Java: Filling in-memory sorted batches

4 Answers4