I need another set of eyes on this.
I've written out a zip file into hundreds of gigabytes with this exact code with no modifications locally on MacOSX.
With 100% unchanged code, just deployed to an AWS instance running Ubuntu, this same code runs into Out of Memory issues (heap space).
Here's the code that's being run, streaming MyBatis to a CSV file on disk:
File directory = new File(feedDirectory);
File file;
try {
file = File.createTempFile(("feed-" + providerCode + "-"), ".csv", directory);
} catch (IOException e) {
throw new RuntimeException("Unable to create file to write feed to disk: " + e.getMessage(), e);
}
String filePath = file.getAbsolutePath();
log.info(String.format("File name for %s feed is %s", providerCode, filePath));
// output file
try (FileOutputStream out = new FileOutputStream(file)) {
streamData(out, providerCode, startDate, endDate);
} catch (IOException e) {
throw new RuntimeException("Unable to write feed to file: " + e.getMessage());
}
public void streamData(OutputStream outputStream, String providerCode, Date startDate, Date endDate) throws IOException {
try (CSVPrinter printer = CsvUtil.openPrinter(outputStream)) {
StreamingHandler<FStay> handler = stayPrintingHandler(printer);
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, handler);
}
}
private StreamingHandler<FStay> stayPrintingHandler(CSVPrinter printer) {
StreamingHandler<FStay> handler = new StreamingHandler<>();
handler.setHandler((stay) -> {
try {
EXPORTER.writeStay(printer, stay);
} catch (IOException e) {
log.error("Issue with writing output: " + e.getMessage(), e);
}
});
return handler;
}
// The EXPORTER method
import org.apache.commons.csv.CSVPrinter;
public void writeStay(CSVPrinter printer, FStay stay) throws IOException {
List<Object> list = asList(stay);
printer.printRecord(list);
}
List<Object> asList(FStay stay) {
List<Object> list = new ArrayList<>(46);
list.add(stay.getUid());
list.add(stay.getProviderCode());
//....
return list;
}
Here's a graph of the JVM heap space (using jvisualvm) when I run this locally. I've run this consistently with of Java 8 (jdk1.8.0_51 and 1.8.0_112) locally and have gotten great results. Even written out a terabyte of data.
^ In the above, the max heap space is set to 4 gigs, and the most it ever increases to is 1.5 gigs, before going back down to around 500 MB, while streaming data to the CSV file as it's supposed to.
However, when I run this on Ubuntu with jdk 1.8.0_111, the exact same operation will not complete, running out of heap space (java.lang.OutOfMemoryError: Java heap space)
I've upped the Xmx value from 8 gigs to 16 to 25 gigs, and still run out of heap space. Meanwhile... the total size of the file is only 10 Gigs in total... which really perplexes me.
Here's what the JVisualVm graph looks like on the Ubuntu box:
I've no doubt it's the exact same code running in both environments, with the same operation being performed in each (same database server providing the same data)
The only differences I can think of at this point are:
- Operating system - Ubuntu vs Mac OS X
- Hosted VM in AWS vs hard metal laptop
- Network speed is faster in AWS between database and Ubuntu server
- JDK version is 1.8.0_111 in Ubuntu, tried 1.8.0_51 and 1.8.0_112 locally
Can anyone help shed any light on this problem?
Update
I've tried replacing all the 'try-with-resources' statements with explicit flush/close statements and no luck.
What's more, I tried to force a garbage collection on the Ubuntu box as soon as I started to see the data come in, and it had no effect-- there is something definitely stopping the heap from being collected on the Ubuntu machine... while running the exact same code on OS X let me write the full enchilada again no problem.
Update 2
In addition to the differences in the environments above, the only other difference I can think of is if the connection between the servers in AWS is so fast that it streams the data faster than it can flush the data to disk... but that still doesn't explain the issue where I only have 10 gigs of data total, and it blows up a JVM with 20 Gigs of heap space.
Is there any likelihood of there being a bug at the Ubuntu/Java level for this?
Update 3
Tried replacing the output of the CSVPrinter
to use an entirely separate library (OpenCSV's CSVWriter in lieu of Apache's CSV library) and the same result occurs.
As soon as this code starts receiving data from the database, the heap starts blowing up and the garbage collector fails to reclaim any memory... but only on Ubuntu. On OS X, everything is reclaimed immediately and the heap never grows.
I've also tried flushing the stream after every write, but had no luck with that as well.
Update 4
Got the heap dump to print out, and according to this I should be looking at the database driver. Specifically the InboundDataHandler in amazon's redshift driver.
I'm using myBatis with a custom result handler. I tried setting the result handler to effectively do nothing when it gets a result (new ResultHandler<>() { // method overridden to do literally nothing}) and I know I'm not holding on to any references there.
Since it's the InboundDataHandler defined by AWS/Redshift... it makes me think it may be lower than the myBatis level... either:
- Error in the SqlSessionFactory I'm setting up
- Bug in the Redshift driver that only pops up in Ubuntu / AWS
- Bug in the result handler I have overwritten
Here's the heap dump screenshot:
Here's where I'm setting up my SqlSessionFactoryBean:
@Bean
public javax.sql.DataSource redshiftDataSource() throws ClassNotFoundException {
log.info("Got to datasource config");
// Dynamically load driver at runtime.
Class.forName(dataWarehouseDriver);
DataSource dataSource = new DataSource();
dataSource.setURL(dataWarehouseUrl);
dataSource.setUserID(dataWarehouseUsername);
dataSource.setPassword(dataWarehousePassword);
return dataSource;
}
@Bean
public SqlSessionFactoryBean sqlSessionFactory() throws ClassNotFoundException {
SqlSessionFactoryBean factoryBean = new SqlSessionFactoryBean();
factoryBean.setDataSource(redshiftDataSource());
return factoryBean;
}
Here's the myBatis code I'm running as a test to verify that it's not me holding on to records in my ResultHandler:
warehouse.doForAllStaysByProvider(providerCode, startDate, endDate, new ResultHandler<FStay>() {
@Override
public void handleResult(ResultContext<? extends FStay> resultContext) {
// do nothing
}
});
Is there a way I can force the SQL connection to not hang on to records or something? I'll again re-iterate that on my local machine, there is no issue with this memory leak... it only surfaces when running the code in the hosted AWS environment. And in both cases, the Database driver and server are the same.
Update 6 I think it's finally fixed. Thanks to all who pointed me in the direction of the heap dump. That helped narrow it down to the offending class in a huge way.
After that, I did some research on the AWS redshift driver, and it explicitly says that your clients should specify a limit for any operations on large data. So I found out how to do that in my myBatis configuration:
<select id="doForAllStaysByProvider" fetchSize="1000" resultMap="FStayResultMap">
select distinct
f_stay.uid,
And this did the trick.
Mind you, this isn't necessary even when handling much larger data sets downloaded remotely from AWS (Database in AWS, code executing on laptop at home), and this shouldn't be necessary since I'm overriding the myBatis ResultHandler<> which handles each row individually and never holds on to any objects.
Yet something funky happens with the AWS redshift jdbc driver only when it's run in AWS (database in aws, code executing in AWS instance) which causes this InboundDataHandler to never release its resources, unless a fetchSize is specified.
Here's the heap of the server running now, getting much further than it ever has before in AWS, with the heap space never moving above 500Mb, and after i hit 'force gc' in jvisualvm, it shows the 'used' heap at less than 100mb:
Thanks again in a huge way to all those who helped guide this!