How does Java GC deals with processed object loaded from a large Stream exceeding available heap memory?

Question

Let's say I have a stream of objects loaded from a database (using Spring Data JPA as follow)

public interface MyJpaRepository extends JpaRepository<Foo, String> {

  Stream<Foo> findAll();
}

And let's say there's millions of Foo objects stored in my database using way more GB than my max heap memory size.

I'm expecting consuming the stream as follow would let the JVM handle properly its heap memory by garbage collecting processed objects as more are loaded from the database:

try (Stream<Foo> fooStream =
    myJpaRepository.findAll()) {
  fooStream.forEach(entity -> logger.info("Hello !"));
}

But in facts, this exact code throws an out of memory exception.

How does the garbage collector acts in this case ?
How consuming this stream using a forEach requires the JVM to entirely load the data from the stream in memory (as per my understanding) ?

Thank you

This is not about the garbage collector. It's about you loading that much data into memory and keeping it ineligible for garbage collection. Your `findAll()` method should return a stream that fetches data from the db on demand (as is done when reading a result set). — ernest_k, Feb 18 '21 at 04:50
@ernest_k you're absolutely right ! Just learned that Postgres (the underlying DB in my case) always returns the entire ResultSet unless configured otherwise. The issue is not with my code with the stream by my `findAll()`method as you pointed out. Even if I feel dummy asking this, how do I configure it to fetch data on demand ? Am I in the right direction if I'm looking into adding a `QueryHint` for `HINT_FETCH_SIZE` ? Could you answer my question with a working example ? — Jeep87c, Feb 18 '21 at 05:07
The problem is with the implementation of `findAll()`. I suspect it's given by *spring-data-jpa*. That implementation should be the one to give a stream that is not sourced from data already loaded into memory. A naive implementation would for example load the data into a list and call `.stream()` on that. Just looked at [Stream rows from PostgreSQL (with fetch size)](https://stackoverflow.com/questions/55952655/stream-rows-from-postgresql-with-fetch-size) and found that it may be the problem. I'm not familiar with Spring, but maybe there's an answer. — ernest_k, Feb 18 '21 at 05:14
This may be a case where the library fails you and you need to get some low-level artifact from it and you take it from there yourself in code (as in, ask it to give you the result set and you build your stream from it using iteration techniques). But this advice should come from someone who knows spring-data-jpa well. — ernest_k, Feb 18 '21 at 05:18

score 1 · Answer 1 · answered Feb 18 '21 at 04:53

Java Stream won't fetch all the data from the underlying database. Streams do not store data; rather, they provide data from a source such as a collection, array, or IO channel. Generally, these are lazily evaluated. So, when the looger.info gets called on each entity, stream will fetch the data from the underlying data store and apply the command. Since the stream just provides an iterator, it only needs to fetch the next data in the iteration not the whole set. And the GC will remove the fetched data once the lambda function has been applied to it.

score 0 · Answer 2 · answered Feb 18 '21 at 04:54

In your scenario, garbage collector will not get time spot to act and clean up your memory. Let me try to explain in more details. When you start your java process, you configured heap memory as well as garbage collection algorithm. In case if you didn't fine tune either of them, JVM take for granted the default settings and proceed. Once your process starts allocating heap, JVM internally collects statistics and schedule garbage collection process. But if your process doesn't provide the breathing space to take a decision on when and how to collect garbage, JVM will throw Out of Memory(OOM) error and crash as you observed.

score 0 · Accepted Answer · answered Feb 25 '21 at 21:57

@ernest_k was 100% in his comment, this issue has nothing to do with Streams. As @avishek-bhattacharya explained:

Streams do not store data; rather, they provide data from a source such as a collection, array, or IO channel. Generally, these are lazily evaluated.

In fact, Postgres (the underlying DB in my case) always returns the entire ResultSet unless configured otherwise (samething for MySQL). And to configure it to use a database Cursor, you need to do as follow:

public interface MyJpaRepository extends JpaRepository<Foo, String> {

  @QueryHints(
    value = {
      @QueryHint(name = HINT_FETCH_SIZE, value = "1000"),
      @QueryHint(name = HINT_CACHEABLE, value = "false"),
      @QueryHint(name = HINT_READONLY, value = "true")
  })
  Stream<Foo> findAll();
}

How does Java GC deals with processed object loaded from a large Stream exceeding available heap memory?

3 Answers3