I am using Spring Batch to extract some data from the Wikipedia XML dump file (a single 30-odd gig file). I am using the StaxEventItemReader
to read in tags and then do some analysis on each page. Once the analysis of each entry is complete, I am injecting the resulting data into a database. It is a very simple Spring Batch workflow:
read->process->write
I would like the processing stage to be multithreaded, as it is self-contained, processor intensive, and the write stage doesn't rely on the order.
/process\
read<-process->write
\process/
I have read this question, where the top answer says that data between stages is stored in the JobRepository
and says that it isn't advisable to store large amounts of data there.
I have seen the 'parallel' example in the Spring Batch distribution but this processes the whole of the second, 'loading' step (ie a reader, processor and writer) in parallel, rather than just running the processing in parallel.
Is it possible to say that the process stage should be processed in a threadpool of a particular size? Does my workflow fit in with Spring Batch, or is it better to rewrite it as a normal J2SE program?