2

I'm dealing with a serious memory issue in Spring Batch which is a worrisome issue.
The flow is simple: read from Oracle, convert to another object type, and write it back to another table. It concerns 200k records.
I tried both HibernateCursorItemReader and RepositoryItemReader.

My job executes the following step:

@Bean
public Step optInMutationHistoryStep() {
    return stepBuilderFactory
            .get(STEP_NAME)
            .<BodiCmMemberEntity, AbstractMutationHistoryEntity> chunk(5)
            .reader(optInItemReader)
            .processor(optInMutationHistoryItemProcessor)
            .writer(mutationHistoryItemWriter)
            .faultTolerant()
            .skipPolicy(itemSkipPolicy)
            .skip(Exception.class)
            .listener((StepExecutionListener) optInCountListener)
            .build();

}

Reader:

@Component
public class OptInItemReaderTest extends RepositoryItemReader<BodiCmMemberEntity> {

    public OptInItemReaderTest(BodiCmMemberRepository bodiCmMemberRepository){
        setRepository(bodiCmMemberRepository);
        setMethodName("findAllOptIns");
        setPageSize(100);
        Map<String, Sort.Direction> sort = new HashMap<>();
        sort.put("member_number", Sort.Direction.ASC);
        setSort(new HashMap<>(sort));
    }

}

Processor:

@Component
@StepScope
public class OptInMutationHistoryItemProcessor implements ItemProcessor<CmMemberEntity, AbstractMutationHistoryEntity> {

    Long jobId;

    @BeforeStep
    public void beforeStep(StepExecution stepExecution){
        jobId = stepExecution.getJobExecutionId();
    }

    private final MutationHistoryBatchFactory mutationHistoryFactory;

    public OptInMutationHistoryItemProcessor(MutationHistoryBatchFactory mutationHistoryFactory) {
        this.mutationHistoryFactory = mutationHistoryFactory;
    }

    @Override
    public AbstractMutationHistoryEntity process(CmMemberEntity cmMemberEntity){
        return mutationHistoryFactory.addMutationHistoryEntity(cmMemberEntity, jobId, OPT_IN);
    }
}

Itemwriter:

@Component
public class MutationHistoryItemWriter extends RepositoryItemWriter<AbstractMutationHistoryEntity>{

    public MutationHistoryItemWriter(MutationHistoryRepository mutationHistoryRepository) {
        setRepository(mutationHistoryRepository);
    }
}

The factory method that I use in the processor:

public AbstractMutationHistoryEntity addMutationHistoryEntity(CmMemberEntity cmMemberEntity, Long jobId, JobType jobType) {
    return mutationHistoryEntityBuilder(cmMemberEntity, jobId, jobType, MutationType.ADD)
            .editionCode(cmMemberEntity.getPoleEditionCodeEntity().getEditionCode())
            .firstName(cmMemberEntity.getFirstName())
            .lastName(cmMemberEntity.getLastName())
            .streetName(cmMemberEntity.getStreetName())
            .houseNumber(cmMemberEntity.getHouseNumber())
            .box(cmMemberEntity.getBox())
            .postalCode(cmMemberEntity.getPostalCode())
            .numberPieces(DEFAULT_NUMBER_PIECES)
            .build();
}

I don't see immediate references held in memory, not sure what causes the rapid memory increase. It doubles quickly.

Intuitively I feel the problem is either that the retrieved resultset isn't regularly flushed, or the processor leaks somehow, but not sure why since I don't hold references to the objects that I create within the processor.

Any suggestion?

Edit:

My full job looks as follows:

@Bean
public Job optInJob(
        OptInJobCompletionNotificationListener listener,
        @Qualifier("optInSumoStep") Step optInSumoStep,
        @Qualifier("optInMutationHistoryStep") Step optInMutationHistoryStep,
        @Qualifier("optInMirrorStep") Step optInMirrorStep) {
    return jobBuilderFactory.get(OPT_IN_JOB)
            .incrementer(new RunIdIncrementer())
            .listener(listener)
            .flow(optInSumoStep)
            .next(optInMutationHistoryStep)
            .next(optInMirrorStep)
            .end()
            .build();
}

In the first step, the same itemreader is used to write records to an XML. In the second step, the step as initially shared is executed.
In OpenShift it's pretty clear that nothing gets cleaned up, and I don't hold any references afaik; I wouldn't know why I would:

enter image description here

It is pretty clear that the consumption does flatten after a while, but ik keeps rising, it never goes down. After the first step (around 680Mb), I would have assumed it would have. Even more, I would expect the curve to go flat during the first step as well; after increasing chunk size to 100, memory should get released each 100 processed chunks.

The HibernateCursorItemReader is pretty disastrous; it rose already to 700Mb during the first step. The repositoryItemWriter seems to perform better. Maybe there's a reason for it, but I'm not clear which one it is.

I can't say right now if anything will get cleaned up after the job; as it processes 200k records, it takes some time and I assume it will run out of memory again before it finishes.

I'm concerned we won't be ready for production if we don't manage to solve this problem.

Edit 2:

Interestingly, the curve has flattened to the point where memory consumption doesn't increase in step 2. It's interesting; I couldn't currently tell why that is 'now'.

enter image description here

Current hope / expectation is that step 3 doesn't increase memory consumption, and that the memory gets cleaned up after the job has finished. It does appear that the insert speed has drastically slowed down. I estimate by a factor of 3 (about 110k records in the second step).

Edit 3: Applying flush on the itemwriter didn't do anything to either the reducing memory consumption or speed:

enter image description here

enter image description here

Actually it's running straight for my limits this time, however I can't explain it:

enter image description here

You can clearly see how the process slows down tremendously, for just reading, transforming, writing records.

enter image description here

I have currently no idea why, but for a batch application I don't think this is acceptable behavior for us to move this to production. After two batch runs it would just go stale.

html_programmer
  • 18,126
  • 18
  • 85
  • 158
  • So what's the memory issue? Does it not get released when the batch finishes processing? Does it increase with every batch? – Roddy of the Frozen Peas Oct 03 '21 at 15:10
  • When I deploy this on openshift, I see that the amount of memory goes straight up during the processing (and keeps on going up without releasing anything). This step is part of a bigger job, but it goes out of memory before the job finishes. (OOM gets thrown) – html_programmer Oct 03 '21 at 15:11
  • Did you try allocating more memory? Maybe it just needs more in your production environment. The fact that you OOM before the job finishes means you never get a good picture of what sort of resources the job needs as a whole. – Roddy of the Frozen Peas Oct 03 '21 at 17:16
  • @RoddyoftheFrozenPeas I don't want to allocate more memory. According tot he documentations I've read, the gc will be called depending on chunk size defined. If the memory keeps increasing, it doesn't matter how much memory I allocate, we will run out of heap memory. It went from 280Mb -> 1Gb. – html_programmer Oct 03 '21 at 18:42
  • Which `CrudRepository` implementation are you using in your `RepositoryItemWriter`? If it is a JPA repository, then you might be facing an issue similar to https://stackoverflow.com/questions/54746793, in which case you need to call `org.springframework.data.jpa.repository.JpaRepository#flush` in your custom writer. – Mahmoud Ben Hassine Oct 04 '21 at 06:35
  • @MahmoudBenHassine It is the RepositoryItemWriter in package org.springframework.batch.item.data;. It does use the CrudRepository, but I can't seem to access it directly since it has private access only. If not I could have done it in the `doWrite` override. I see that you are the author in comment. The heap memory issue is majorly solved; still had a heap issue, but it's just the jvm -Xmx setting I think. But it gets very slow still over time (took up to 4 hours I noticed); if you would have a suggestion to manage the flush? – html_programmer Oct 04 '21 at 08:01
  • Currently just adding the repository locally to the itemwriter itself / overriding the write method with repository.flush in the end; I don't suppose it makes a difference. I'll deploy / run it and share the result. – html_programmer Oct 04 '21 at 08:34
  • Not just call `flush` also `clear`. You need the `EntityManager` for that. What happens is that objects get added to the 1st level cache and dirty checking is done. With a `flush()` you sync state to the database **but** the objects are still in the 1st level cache. WIth 100 objects this isn't an issue, when you have 100000 objects that is an issue. – M. Deinum Oct 04 '21 at 10:06
  • @M.Deinum You're right, thanks for the tip, I'll try this out in the afternoon. – html_programmer Oct 04 '21 at 10:19
  • 1
    Yes, it should be `flush` and `clear`. I seem to have forgotten to mention `clear` in my previous comment but this is mentioned in the similar question/answer I shared. – Mahmoud Ben Hassine Oct 04 '21 at 11:59

1 Answers1

1

The issue got resolved (or at least to an acceptable level) with thanks to @Mahmoud Ben Hassine and @M. Deinum.
The itemWriters look as follows:

@Component
public class MutationHistoryItemWriter extends RepositoryItemWriter<AbstractMutationHistoryEntity>{

    @PersistenceContext
    private EntityManager entityManager;

    private final MutationHistoryRepository mutationHistoryRepository;

    public MutationHistoryItemWriter(MutationHistoryRepository mutationHistoryRepository) {
        this.mutationHistoryRepository = mutationHistoryRepository;
        setRepository(mutationHistoryRepository);
    }

    @Override
    public void write(List<? extends AbstractMutationHistoryEntity> items) throws Exception {
        super.write(items);
        mutationHistoryRepository.flush();
        entityManager.clear();
    }

}

public class MirrorItemWriter  extends RepositoryItemWriter<SumoCmMemberEntity> {

    @PersistenceContext
    private EntityManager entityManager;

    private final SumoCmMemberRepository sumoCmMemberRepository;

    public MirrorItemWriter(SumoCmMemberRepository sumoCmMemberRepository) {
        this.sumoCmMemberRepository = sumoCmMemberRepository;
        setRepository(sumoCmMemberRepository);
    }

    @Override
    public void write(List<? extends SumoCmMemberEntity> items) throws Exception {
        super.write(items);
        sumoCmMemberRepository.flush();
        entityManager.clear();
    }

}

Interestingly:

  • Writing the xml also suddenly sped up drastically (step 1) after flushing the writers and clearing the entityManagers, which doesn't make sense to me but hey it works (I don't use async processing or multithreading).
  • The HibernateCursorItemReader performs more normal now as well. I'm not sure what the behavior of massive memory consumption triggered during my initial tests (difference between HibernateCursorItemReader vs RepositoryItemReader), but it doesn't matter at this point.

I executed the test 2 times with a break in between, processing 3steps * 200k records (600k total) and this 2 times.

It basically went from 4 hours + heap overflow to a mere few minutes on the first run:

enter image description here

On the second run, I see some degradation happen again, but is still what I 'd consider fast considering the situation before:

enter image description here

For some reason, the memory just doesn't want to give it break though. I'm starting to take it personal (LoL).
The processor spikes briefly, but nothing out of order.

enter image description here

It is not really a problem for me, since the 200k records is just a load test. Initial state will be set through sql scripts, so it's not really an issue. Consequent mutations will concern a few thousands per week at max.
We will also putting Influx / Grafana monitoring, so we will be able to monitor the jvm + have alerting.
Still bothers me a bit though so I will continue to do some tests the rest of the day.

Edit: After replacing all HibernateCursorItemReaders by RepositoryItemReaders, the memory consumption looks good.
Following graph represents 2 batch runs for 600k records total each:

enter image description here

Dealing with a strange issue that only half of my records were stored in the db, but that may be because of an unrelated bug.

Edit: Only half of the records were stored because the RepositoryItemReader is paginated. I fixed the memory bug with the HibernateItemReader meanwhile.

@Component
public class OptInItemReader extends HibernateCursorItemReader<BodiCmMemberEntity> {

    SessionFactory sessionFactory;

    private static final String ITEM_READER_NAME = "OPT_IN_ITEM_READER";
    private static final String QUERY_FIND_OPT_INS =

            """
            Your query 
            """;

    public OptInItemReader(SessionFactory sessionFactory) throws Exception {
        this.sessionFactory = sessionFactory;
        setName(ITEM_READER_NAME);
        setSessionFactory(sessionFactory);
        setQueryProvider(provider(QUERY_FIND_OPT_INS, sessionFactory));
    }

    private HibernateNativeQueryProvider<BodiCmMemberEntity> provider(String query, SessionFactory sessionFactory) {
        HibernateNativeQueryProvider<BodiCmMemberEntity> provider = new HibernateNativeQueryProvider<>();
        provider.setSqlQuery(query);
        provider.setEntityClass(BodiCmMemberEntity.class);
        return provider;
    }

}

I assigned a session on which I began a transacton manually in the provider which wasn't needed. Now the flush is properly managed internally.

html_programmer
  • 18,126
  • 18
  • 85
  • 158