I need to write ~50 million rows fetched from a jdbc ResultSet to a CSV file.
1.5 million rows written to a CSV file amounts to 1 GB approximately.
jdbcTemplate.query(new CustomPreparedStatementCreator(arg), new ResultSetExtractor<Void>() {
@Override
public Void extractData(ResultSet rs) {
while (rs.next()) {
// transform each row's data (involves creation of objects)
// write the transformed strings to csv file
}
}
The problem is I have a heap of 8 GB and it gets filled up pretty fast.
Hence I run into java.lang.OutOfMemoryError before I get to 10 million rows.
Another limitation I have is the query read/write timeout which is set to 30 minutes.
What can I do to recycle and reuse the JVM heap memory?
Especially the memory allocated for objects that I don't need anymore.
I read that forcing GC to run does not guarantee memory will be reclaimed.
What are my options? Should I defer the responsibility to non-GC languages
like C,C++ via JNA or JNI to process the ResultSet?
[EDIT] It seems I am in a tough spot :D Adding more info as pointed out by @rzwitserloot
- I am reading (SELECT queries only) data from a data-virtualization tool that is hooked to a data lake.
- The data-virtualization tool's jdbc driver does support LIMIT but the queries are designed by the business to return huge volumes of data. So I've got one-shot to pull the data and generate a CSV - meaning, I cannot avoid the giant SELECT or put a LIMIT clause
- I need to check these properties:
resultSetType
,resultSetConcurrency
,resultSetHoldability
.
What I have already done:
First, I used a Producer-Consumer pattern to separate the jdbc fetch operations from slow file write operations. This helped create CSV files containing 1-5 million rows before 30 mins timeout.
Second, I increased the number of consumer threads and have them write to their own separate part-file only to be merged later into a single CSV file. This sped up file write and create a CSV file containing 10-20 million rows before the 30 mins timeout.
I am creating objects inside the ResultSetExtractor and passing it to consumer threads via a bounded queue. These objects are not needed once the data from them is written to the file.