Java code runs out of space memory on AWS but not MacOSX

Question

I need another set of eyes on this.

I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.

With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).

Here's the code that's being run, streaming MyBatis to a CSV file on disk:

File directory = new File(feedDirectory);
    File file;
    try {
        file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
    } catch (IOException e) {
        throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
    }

    String filePath = file.getAbsolutePath();
    log.info(String.format("File name for %s feed is %s", providerCode, filePath));

    // output file
    try (FileOutputStream out = new FileOutputStream(file)) {
        streamData(out, providerCode, startDate, endDate);
    } catch (IOException e) {
        throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
    }

    public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
    try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
        StreamingHandler<FStay> handler = stayPrintingHandler(printer);
        warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
    }
}

private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
    StreamingHandler<FStay> handler = new StreamingHandler<>();
    handler.setHandler((stay) -> {
        try {
            EXPORTER.writeStay(printer, stay);
        } catch (IOException e) {
            log.error("Issue with writing output: " + e.getMessage(), e);
        }
    });
    return handler;
}

// The EXPORTER method
 import org.apache.commons.csv.CSVPrinter;
    public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
    List<Object> list = asList(stay);
    printer.printRecord(list);
}

List<Object> asList(FStay stay) {
    List<Object> list = new ArrayList<>(46);
    list.add(stay.getUid());
    list.add(stay.getProviderCode());
    //....
    return list;
}

Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.

^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.

However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)

I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.

Here's what the JVisualVm graph looks like on the Ubuntu box:

I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)

The only differences I can think of at this point are:

Operating system - Ubuntu vs Mac OS X
Hosted VM in AWS vs hard metal laptop
Network speed is faster in AWS between database and Ubuntu server
JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally

Can anyone help shed any light on this problem?

Update

I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.

What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.

Update 2

In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.

Is there any likelihood of there being a bug at the Ubuntu/Java level for this?

Update 3

Tried replacing the output of the CSVPrinter to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.

As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.

I've also tried flushing the stream after every write, but had no luck with that as well.

Update 4

Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.

I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.

Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:

Error in the SqlSessionFactory I'm setting up
Bug in the Redshift driver that only pops up in Ubuntu / AWS
Bug in the result handler I have overwritten

Here's the heap dump screenshot:

Here's where I'm setting up my SqlSessionFactoryBean:

 @Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
    log.info("Got to datasource config");
    // Dynamically load driver at runtime.
    Class.forName(dataWarehouseDriver);
    DataSource dataSource = new DataSource();
    dataSource.setURL(dataWarehouseUrl);
    dataSource.setUserID(dataWarehouseUsername);
    dataSource.setPassword(dataWarehousePassword);
    return dataSource;
}

@Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
    SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
    factoryBean.setDataSource(redshiftDataSource());
    return factoryBean;
}

Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:

warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
            @Override
            public void handleResult(ResultContext<? extends FStay> resultContext) {
                // do nothing

            }
        });

Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.

Update 6 I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.

After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:

<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">        
    select distinct
        f_stay.uid,

And this did the trick.

Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.

Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.

Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:

Thanks again in a huge way to all those who helped guide this!

Can you tell us how the data actually gets written to file? Off the top of my head, the Linux and MacOS JVMs could have some slight implementation differences. Same Java code, potentially different bytecode. — Tim Biegeleisen, Nov 04 '16 at 01:37
I was wondering if it could be an IOPS issue-- like if it were not flushing the results to disk, and if the network speed were too fast on AWS. I'll update with the code which writes to disk, but it's essentially Apache's CSV library call to write to a printer — Cuga, Nov 04 '16 at 01:39
This sounds like a start. Could you try some other methods, such as Apache's IOUtils? — Tim Biegeleisen, Nov 04 '16 at 01:44
@ScaryWombat I'll try closing the streams explicitly rather than try-with-resources next... worth a whirl — Cuga, Nov 04 '16 at 01:48
Could you find *what* is filling up your heap? Visualvm can probably tell you. You can also enable [`-XX:HeapDumpOnOutOfMemoryError`](https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/clopts001.html#CHDFDIJI). — Edward Samson, Nov 04 '16 at 01:54
Thanks, I added `-XX:+HeapDumpOnOutOfMemoryError` to my tomcat java_opts (needed the + sign). I don't have a screenshot handy, but all the memory allocation was basically in `char []` arrays... — Cuga, Nov 04 '16 at 02:26
Explicitly closing the streams doesn't make a difference... what's more- the heap usage on the AWS ubuntu box increases faster than the data is downloaded.... i.e. I hit the 8 gig heap limit very quickly... while only a couple hundred MB are written to disk... and it fails when the heap is set to 20 GB, while I'm only downloading /writing 10 GB total... which is extremely confusing. — Cuga, Nov 04 '16 at 02:37
The neat thing about a heap dump is that it not only tells you what objects are consuming the memory, but also, which objects are referencing these objects, so you can analyze the potential leak’s structure. Don’t stop at the finding that the memory is occupied by `char[]` arrays, try to find out to what these arrays belong. — Holger, Nov 04 '16 at 15:38
Thanks - I finally got the heap dump to download from the server and added a screenshot. It highlights something in the Amazon redshift database driver as a likely culprit.... — Cuga, Nov 04 '16 at 16:13
You should consider pulling at least the final updates out into an answer Cuga. — Gray, Nov 04 '16 at 17:19

score 5 · Accepted Answer · answered Nov 04 '16 at 18:42

Finally figured out a solution.

The heap dump was the biggest aid-- it indicated the InboundDataHandler class of Amazon's RedShift/postgres JDCB driver was the prime culprit.

The code to set up the SqlSession appeared legit, so traveling over to Amazon's documentation landed this gem:

To avoid client-side out-of-memory errors when retrieving large data sets using JDBC, you can enable your client to fetch data in batches by setting the JDBC fetch size parameter.

We hadn't run into this before, as we stream results with custom ResultHandlers in MyBatis... but there seems to be something different when the AWS Redshift JDBC driver is running on AWS itself vs outside AWS connecting in.

Taking the guidance from the documentation, we added a 'fetchSize' to our MyBatis select query:

<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">        
select distinct
    f_stay.uid,

And voila! Everything worked swimmingly. This is the only change we made and the heap never went above a couple hundred MBs.

You can see in one of the above graphs where the heap goes off the charts, as soon as the data started to be received on Amazon, the heap marches right up linearly and never reclaims an ounce of heap space once it starts.

My guess is the Redshift JDBC driver is doing something different when it's in Amazon's environment for some kind of optimization... that's all I can think of to explain the behavior.

Clearly Amazon knows what's going on since they documented it up front. I may not know the full 'why' of what's happening, but at least everything is resolved in what appears to be a satisfactory way.

Thanks to all those who helped.

If you happen to run into this issue when using MySql. MySql only supports a fetch size of all or 1. For fetching 1 record at a time, you need to set the fetchSize to Integer.MIN_VALUE (go figure). http://stackoverflow.com/questions/2447324/streaming-large-result-sets-with-mysql — Jason White, Jan 06 '17 at 18:17

Java code runs out of space memory on AWS but not MacOSX

1 Answers1