0

I had a problem with a Spring Batch job for reading a large CSV file (a few million records) and saving the records from it to a database. The job uses FlatFileItemReader for reading the CSV and JpaItemWriter for writing read and processed records to the database. The problem is that JpaItemWriter doesn't clear the persistence context after flushing another chunk of items to the database and the job ends up with OutOfMemoryError.

I have solved the problem by extending JpaItemWriter and overriding the write method so that it calls EntityManager.clear() after writing a bunch, but I was wondering whether Spring Batch addresses this issue already and the root of the problem is in the job config. How to address this issue the right way?

My solution:

class ClearingJpaItemWriter<T> extends JpaItemWriter<T> {

        private EntityManagerFactory entityManagerFactory;

        @Override
        public void write(List<? extends T> items) {
            super.write(items);
            EntityManager entityManager = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory);

            if (entityManager == null) {
                throw new DataAccessResourceFailureException("Unable to obtain a transactional EntityManager");
            }

            entityManager.clear();
        }

        @Override
        public void setEntityManagerFactory(EntityManagerFactory entityManagerFactory) {
            super.setEntityManagerFactory(entityManagerFactory);
            this.entityManagerFactory = entityManagerFactory;
        }
    }

You can see the added entityManager.clear(); in the write method.

Job config:

@Bean
public JpaItemWriter postgresWriter() {
    JpaItemWriter writer = new ClearingJpaItemWriter();
    writer.setEntityManagerFactory(pgEntityManagerFactory);
    return writer;
}

@Bean
    public Step appontmentInitStep(JpaItemWriter<Appointment> writer, FlatFileItemReader<Appointment> reader) {
        return stepBuilderFactory.get("initEclinicAppointments")
                .transactionManager(platformTransactionManager)
                .<Appointment, Appointment>chunk(5000)
                .reader(reader)
                .writer(writer)
                .faultTolerant()
                .skipLimit(1000)
                .skip(FlatFileParseException.class)
                .build();
    }

@Bean
    public Job appointmentInitJob(@Qualifier("initEclinicAppointments") Step step) {
        return jobBuilderFactory.get(JOB_NAME)
                .incrementer(new RunIdIncrementer())
                .preventRestart()
                .start(step)
                .build();
    }
super.t
  • 2,526
  • 7
  • 32
  • 51
  • If you are sure about EM issue maybe an approch using `ChunkListener#afterChunk` or `ItemWriteListener#afterWrite` is less intrusive than your solution. Checking jpa-writer code a `EntityManager.flush` is performed after every write, so issue should not happens. Did you try with different (small) chunk-size/skip-limit? – Luca Basso Ricci Feb 18 '19 at 16:06
  • @LucaBassoRicci I might be wrong but flush doesn't clear the context. The listeners indeed look better than my solution, I just didn't know the API well. The skip limit of 1000 which I used is an appropriate percent of "bad" records in the CSV before the job fails, and the chunk size of 5000 is half as much smaller than the original 10k chunk. The answer here https://stackoverflow.com/questions/13886608/when-to-use-entitymanager-clear says that EM.clear must be called when doing batch processing, so maybe the the listeners is the place to make a call to EM.clear when dealing with large files – super.t Feb 19 '19 at 09:25
  • I created https://jira.spring.io/browse/BATCH-2797 for this. Thanks for reporting it. – Mahmoud Ben Hassine Feb 26 '19 at 11:32

1 Answers1

2

That's a valid point. The JpaItemWriter (and HibernateItemWriter) used to clear the persistent context but this has been removed in BATCH-1635 (Here is the commit that removed it). However, this has been re-added and made configurable in the HibernateItemWriter in BATCH-1759 through the clearSession parameter (See this commit) but not in the JpaItemWriter.

So I suggest to open an issue against Spring Batch to add the same option to the JpaItemWriter as well in order to clear the persistence context after writing items (This would be consistent with the HibernateItemWriter).

That's said, to answer your question, you can indeed use a custom writer to clear the persistence context as you did.

Hope this helps.

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50