Hash based partitioning

Question

I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).

Any idea if this is possible with Spring Batch?

How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?

Thanks, Mickael

score 2 · Accepted Answer · answered Sep 02 '15 at 21:13

2

What you're describing is something normally seen in batch processing. You have a couple options here:

Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.

In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.

answered Sep 02 '15 at 21:13

Michael Minella

20,843
4
55
67

I would go for option 1. For very large files, option 2 should yield bad performances since it is very close to the regular batch job I'm trying to run (except the processing part). For option 1: should I split the file in a Tasklet? and then define a partitioned step? – manash Sep 03 '15 at 07:05
For the record, I'd expect option 2 to actually perform better than option 1. Since order doesn't matter when importing the data into a staging table, you can just use `split` to split the file up into even number of files and parallelize it's import. For option 1, that's really up to you, but you're going to need to write to multiple files at once so there's some shuffling that will need to go on as you split the file (which is one reason I'd expect option 2 to perform better). – Michael Minella Sep 03 '15 at 14:13
I still need to preserve the order for records that have the same hash. I could do that with option 2 by also storing the line number along with each record. But, I finally used option 1 which looks great so far. The job starts with a tasklet that reads the file line by line, computes a hash for each line and inserts the line in its appropriate file. Then, I have a partitioned step that uses a MultiResourcePartitioner as you recommended. I just need to check how much time will the tasklet take to process huge files. Thanks for your help! – manash Sep 04 '15 at 05:52

score 0 · Answer 2 · edited May 23 '17 at 12:14

0

Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.

There is more info in the docs:

Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.

I think your question is somehow duplicate to this: multithreaded item reader

edited May 23 '17 at 12:14

Community

1
1

answered Sep 02 '15 at 20:33

Maciej Stępyra

310
1
6

Maybe You can consider using another technology to parallelize Your processing: projectreactor or rxjava – Maciej Stępyra Sep 02 '15 at 20:34
I don't see any issue if the item reader is declared with step scope. Since a step is created for every partition, I guess a different item reader will be used by every partition, right? – manash Sep 02 '15 at 20:36
Thats true, but they will read the same items and process them. As far as i saw in documentation You have to use multiresourceitemreader, then you can process multiple files using multiple threads - one file per partition. [see docs](http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html) check last example – Maciej Stępyra Sep 02 '15 at 20:58
Different item readers will be created and will read the entire file. However, I thought that every item reader can compute a hash of the current line and decide if the line should be processed or skipped (and therefore will be processed by another item reader). The fact that every item reader reads the entire file is not really concern since it should be pretty fast. – manash Sep 03 '15 at 07:08

Hash based partitioning

2 Answers2