5

I'm having something like:

    List<Data> dataList = stepts.stream()
        .flatMap(step -> step.getPartialDataList().stream())
        .collect(Collectors.toList());

So I'm combining into dataList multiple lists from every step.

My problem is that dataList might run into OutOfMemoryError. Any suggestions on how I can batch the dataList and save the batches into db?

My primitive idea is to:

    for (Step step : steps) {
        List<Data> partialDataList = step.getPartialDataList();

        if (dataList.size() + partialDataList.size() <= MAXIMUM_SIZE) {
            dataList.addAll(partialDataList);
        } else {
            saveIntoDb(dataList);
            dataList = new ArrayList<>();
        }
    }

PS: I know there is this post, but the difference is that I might not be able to store whole data in memory.

LE: getPartialDataList metod is more like createPartialDataList()

Naman
  • 27,789
  • 26
  • 218
  • 353
UnguruBulan
  • 890
  • 4
  • 12
  • 24

1 Answers1

5

If your concern is OutOfMemoryError you probably shouldn't create additional intermediate data structures like lists or streams before saving to the database.

Since the Step.getPartialDataList() already returns List<Data> the data is already in the memory, unless you have your own List implementation. You just need to use JDBC batch insert:

PreparedStatement ps = c.prepareStatement("INSERT INTO data VALUES (?, ?, ...)");
for (Step step : steps) {
    for (Data data : step.getPartialDataList()) {
        ps.setString(1, ...);
        ps.setString(2, ...);
        ...
        ps.addBatch();
    }   
}
ps.executeBatch();

There is no need to chunk into smaller batches prematurely with dataList. First see what your database and JDBC driver are supporting before doing premature optimizations.

Do note that for most databases the right way to insert large amount of data is an external utility and not JDBC e.g. PostgreSQL has COPY.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • Best case would have been to do a bulk insert with all data. But in a prod env there might be a small chance to get OutOfMemory. On step.getPartialDataList() I will create data on the fly, maybe the naming here is not the best. But surrely the enitre data is not in memory – UnguruBulan Nov 19 '19 at 21:44
  • @UnguruBulan I'd argue this answer is still the right one for your use-case. You can't process less data at a time than the result of one `getPartialDataList()` – MyStackRunnethOver Nov 20 '19 at 00:07
  • I want to process more data than the result of `getPartialDataList()` – UnguruBulan Nov 20 '19 at 07:30