Spring Batch taking long time to complete the Job

Question

I am working on a Spring batch application. This application read data from DB, process and send to the Kafka. I need to read data from two table in parent-child relationship.

Like :

Parent : - Id, Name

Child : - Id, Name, Parent_Id

I am using JpaPagingItemReader. I am reading Parent table data from reader and Child data from process.

    @Autowired
    private JpaTransactionManager transactionManager;

    @PersistenceContext
    private EntityManager em;

    public ItemStreamReader<Parent> reader() {
        JpaPagingItemReader<Parent> itemReader = new JpaPagingItemReader<>();
        try {
            String sqlQuery = "SELECT * FROM PARENT";
            JpaNativeQueryProvider<Parent> queryProvider = new JpaNativeQueryProvider<Parent>();
            queryProvider.setSqlQuery(sqlQuery);
            queryProvider.setEntityClass(Parent.class);
            queryProvider.afterPropertiesSet();

            itemReader.setEntityManagerFactory(em.getEntityManagerFactory());

            itemReader.setPageSize(100);
            itemReader.setQueryProvider(queryProvider);
            itemReader.afterPropertiesSet();
            itemReader.setSaveState(true);
        }
        catch (Exception e) {
            System.out.println("BatchConfiguration.reader() ==> error " + e.getMessage());
        }
        return itemReader;    
    }
    @Autowired
    private ChildRepository childRepository;
    @Bean
    public ItemProcessor<Parent,ParentVO> opptyProcess() {
        return new ItemProcessor<Parent, ParentVO>() {

            @Override
            public ParentVO process(Parent parent) throws JsonProcessingException {

                ParentVO parentVO = new ParentVO();

                parentVO.setId(parent.getId());
                parentVO.setName(parent.getName());

                List<Child> childList = childRepository.findByParentId(parent.getId());
                if(childList != null && childList.size() > 0) {
                    for(Child child :childList) {
                        ChildVo childVO= new ChildVO();
                            childVO.setId(child.getId);
                            childVO.setName(child.getName());
                            childVO.setParentId(child.getParentId())
                        ParentVO.getChildList().add(childVO);
                    }
                }
                return parentVO;
            }
        };
    }

    @Bean
    public Step step() {
        return stepBuilderFactory.get("step1")
                .<Parent, ParentVO>chunk(100)
                .reader(reader())
                .processor(process())
                .writer(writer)
                .taskExecutor(threadPool)
                .transactionManager(transactionManager)
                .throttleLimit(10)
                .build();
    }

I am testing this app with 20k records. The performance of this app is very slow. Every minute it can read/process/write only 100 records. If I comments the below line it takes 2 minutes to complete the job.

 List<Child> childList = childRepository.findByParentId(parent.getId());
                if(childList != null && childList.size() > 0) {
                    for(Child child :childList) {
                        ChildVo childVO= new ChildVO();
                            childVO.setId(child.getId);
                            childVO.setName(child.getName());
                            childVO.setParentId(child.getParentId())
                        ParentVO.getChildList().add(childVO);
                    }
                }

What the other way I can do to get the Child table data and make this Job faster.

score 0 · Accepted Answer · answered Apr 29 '20 at 07:30

You are basically doing a join operation on the application side, and that's the cause of your performance issue.

The pattern you are implementing is similar to the driving query pattern, where a processor is used to enrich items returned by the reader. This pattern is known to suffer from the n+1 problem, which performs poorly in some circumstances.

I recommend you do the join on the database side. Relational database systems are well optimized for this kind of operations, and if your application grabs data already joined on the db side, you will notice a big performance boost.

Thanks for your response. I can join in DB side if it was one-to-one relation. But this is one-to-many data. I don't know how I can avoid the second query for each records. So in this kind of scenario performance will be always poor with high volume data? — user12258186, Apr 29 '20 at 15:31

Spring Batch taking long time to complete the Job

1 Answers1