0

Context: I am working on Spring Batch pipeline which will upload data to database. I have already figured out how to read flat .csv files and write items from it with JdbcBatchItemWriter. The pipeline I am working on must read data from Zip archive which contains multiple .csv files of different types. I'd like to have archive downloading and inspecting as two first steps of job. I do not have enough disk space to unpack the whole loaded archive. Instead of unpacking, I inspect zip file content to define .csv files paths inside Zip file system and their types. Zip file inspecting allows obtaining InputStream of corresponding csv file easily. After that, reading and uploading (directly from zip) all discovered .csv files will be executed in separate steps of the job.

Question: Is there any way to dynamically populate new Steps for each of discovered csv entries of zip at job runtime as result of inspect step?

Tried: I know that Spring Batch has conditional flow but it seems that this technique allows configuring only static number of steps that are well defined before job execution. In my case, the number of steps (csv files) and reader types are discovered at the second step of the job.

Also, there is a MultiResourceItemReader which allows to read multiple resources sequentially. But I'm going to read different types of csv files with appropriate readers. Moreover, I'd like to have "filewise" step encapsulation such that if one file loading step fails, others will be executed anyway.

The similar question How to create dynamic steps in Spring Batch does not have suitable solution for me as the answer supposes steps creation before job running, while I need to add steps as result of second step of job.

peremeykin
  • 519
  • 5
  • 11
  • have your batch create a file which details which steps will have to be executed. On the end of the batch, execute a second batch which will execute the steps – Stultuske Jul 08 '21 at 06:56
  • The dynamic step creation is not the real issue here, you can create job-scoped step bean definition at runtime and refresh the Spring application context. Those steps will be created lazily after the inspection step. I can provide an example. However, the real issue here is that you should really know how to handle each type of file upfront right? Even if the content of the zip is not known upfront, you should know upfront how to handle each type of csv file that might be present in the zip. Do you agree on that? – Mahmoud Ben Hassine Jul 08 '21 at 10:03
  • @MahmoudBenHassine you are right, there are some difficulties related with different types of file content, but I expect to coupe with it using `BufferedInputStream bis`. I am going to set the `bis` buffer size large enough to consume header and first data line entirely to determine the file type. Before reading `bis` I mark it to return to the stream start after file type is determined. Based on the type, I'm going to configure `ItemReader` appropriately. Please provide the example you mentioned, I will appreciate it very much. – peremeykin Jul 08 '21 at 10:43
  • How many distinct file types are you expecting to receive in the zip? You should have a domain class for each one right? Those types should be prepared upfront I guess. My question is not about the implementation detail of how to detect the file type, but rather about how are you going to process each file (ie the step definition that you are trying to create dynamically). You should at least know what to do with each file type beforehand. Please share an example with two types of different steps that you are trying to create dynamically to understand what you are trying to achieve. – Mahmoud Ben Hassine Jul 08 '21 at 11:56

1 Answers1

1

You could use partitioned steps

  1. Pass variable containing list of csv as resources to JobExecutionContext during your inspect step
  2. In partition method retrieve the list of csv and create a partition for each one.
  3. The step will be executed for each of the partition created
ACH
  • 185
  • 11
  • Sounds good, but at first sight It seems that all partitions are supposed to be executed simultaneously, while I do not have enough database connections quota to upload all files at the same time. Is there any way to process partitions sequentially? – peremeykin Jul 08 '21 at 09:20
  • 1
    Yes, you can either use a SyncTaskExecutor for your partitionner handler which is sequencial, or SimpleAsyncTaskExecutor and set concurrencyLimit to 1 which basically makes it sequential (with potential to set it higher when desired). Third option is to use ThreadPoolTaskExecutor and set maximumpoolsize to 1 – ACH Jul 08 '21 at 12:22