2

Spring Batch is designed to read and process one item at a time, then write the list of all items processed in a chunk. I want my item to be a List<T> as well, to be thus read and processed, and then write a List<List<T>>. My data source is a standard Spring JpaRepository<T, ID>.

My question is whether there are some standard solutions for this "aggregated" approach. I see that there are some, but they don't read from a JpaRepository, like:

Update:

I'm looking for a solution that would work for a rapidly changing dataset and in a multithreading environment.

Daniel Pop
  • 456
  • 1
  • 6
  • 23
  • Write a method that returns what you want, invoke and use it. You already have the information (you linked to it in your own question). – M. Deinum Sep 19 '22 at 13:38
  • The other approaches allow for marking the amount of data processes (with BEGIN/END-like tags), which is not the case with Jpa, I'm afraid @M.Deinum – Daniel Pop Sep 19 '22 at 14:00
  • Why wouldn't JPA allow for that. Your last list will have less entries as your chunk size, just as usual.. That doesn't change. – M. Deinum Sep 19 '22 at 16:41
  • One detail that I forgot to mention is that the dataset is rapidly changing. In my specific case, I'm deleting each row I'm reading, so I should go for the first page each time. But that wouldn't work in a multithreaded Spring Batch job. – Daniel Pop Sep 20 '22 at 06:03
  • 1
    You nowhere mentioned the multi-threaded part. You could make that work with a synced listener and make it smart so that it knows what to read (or use a `Stream` to read x items, stuff it in a list and return it). – M. Deinum Sep 20 '22 at 08:57
  • Could you point me to some references where I can learn more about such an approach? @M.Deinum – Daniel Pop Sep 20 '22 at 09:54
  • 1
    You can create an `ItemReader` which reads x items from a `Stream` (and write a `JpaRepository` method that returns a `Stream` not a page or list). Wrap that in a `SyncItemReader` so that only 1 proces can read at a time. With that it should work. Where to look, generally the Spring Batch documentation and Spring Data JPA on how to write a method returning a `Stream`. – M. Deinum Sep 20 '22 at 11:02

2 Answers2

1

I want my item to be a List as well, to be thus read and processed, and then write a List<List>.

Spring Batch does not (and should not) be aware of what an "item" is. It is up to you do design what an "item" is and how it is implemented (a single value, a list, a stream , etc). In your case, you can encapsulate the List<T> in a type that could be used as an item, and process data as needed. You would need a custom item reader though.

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
0

The solution we found is to use a custom aggregate reader as suggested here, which accumulates the read data into a list of a given size then passes it along. For our specific use case, we read data using a JpaPagingItemReader. The relevant part is:

    public List<T> read() throws Exception {
        ResultHolder holder = new ResultHolder();

        // read until no more results available or aggregated size is reached
        while (!itemReaderExhausted && holder.getResults().size() < aggregationSize) {
            process(itemReader.read(), holder);
        }

        if (CollectionUtils.isEmpty(holder.getResults())) {
            return null;
        }

        return holder.getResults();
    }

    private void process(T readValue, ResultHolder resultHolder) {
        if (readValue == null) {
            itemReaderExhausted = true;
            return;
        }
        resultHolder.addResult(readValue);
    }

In order to account for the volatility of the dataset, we extended the JPA reader and overwritten the getPage() method to always return 0, and controlled the dataset through the processor and writer to have the next fresh data to be fetched always on the first page. The hint was given here and in some other SO answers.

public int getPage() {
    return 0;
}
Daniel Pop
  • 456
  • 1
  • 6
  • 23