2

I am using Spring Batch to extract some data from the Wikipedia XML dump file (a single 30-odd gig file). I am using the StaxEventItemReader to read in tags and then do some analysis on each page. Once the analysis of each entry is complete, I am injecting the resulting data into a database. It is a very simple Spring Batch workflow:

read->process->write

I would like the processing stage to be multithreaded, as it is self-contained, processor intensive, and the write stage doesn't rely on the order.

     /process\
read<-process->write
     \process/

I have read this question, where the top answer says that data between stages is stored in the JobRepository and says that it isn't advisable to store large amounts of data there.

I have seen the 'parallel' example in the Spring Batch distribution but this processes the whole of the second, 'loading' step (ie a reader, processor and writer) in parallel, rather than just running the processing in parallel.

Is it possible to say that the process stage should be processed in a threadpool of a particular size? Does my workflow fit in with Spring Batch, or is it better to rewrite it as a normal J2SE program?

Community
  • 1
  • 1
Rich
  • 15,602
  • 15
  • 79
  • 126
  • I don't know Spring Batch, but have you seen [5.3.5. Split Flows](http://static.springsource.org/spring-batch/reference/html/configureStep.html#split-flows) chapter in the documentation? – Tomasz Nurkiewicz Mar 01 '12 at 12:42
  • I had, but I find it hard to see exactly how to achieve what I want. The examples in 5.3.5 and 7.2 Parallel Steps don't show how the information gets passed between them, and don't seem to show how the read data can be split across the processing threads. – Rich Mar 01 '12 at 12:49
  • [7.1 in the documentation](http://static.springsource.org/spring-batch/reference/html/scalability.html) says that you can simply configure a `TaskExecutor` into the step which will automatically cause each chunk to be processed in a separate thread. (I've done this once as a proof-of-concept and it pretty much worked as described.) The downside is that the `StaxEventItemReader` is not thread safe. Furthermore, this isn't really answering the question of how to pass data among the readers; even if you wrote your own, there's no way of sharing data that's really effective. – Ickster Mar 05 '12 at 13:37

1 Answers1

0

Your reader has to be thread safe. If this is not possible, I suggest you to use a staging area :

  • first step : analyze your datas and store them in a convenient format somewhere.
  • once finished, begin the second step : insert the datas using multithreading, sql batching and all the stuff we have in Java to boost the performances.

Maybe a NoSQL database could be a good candidate to store the staging data.

Jean-Philippe Briend
  • 3,455
  • 29
  • 41