0

If some of you could help me design my spring batch process it would be nice :D I need to perform ETL by consuming a REST API & then store some data from it. This process must be daily & Spring Batch seems perfect to achieve what I want since we already are using Spring framework for a lots of stuff at the company I work at. But I am struggling on how to design my job(s?)/tasklet etc.

Could you please help me designing what would be the most appropriate way to do what I want ?

Summary of what i need to do :

  1. Consume a summary list of all Items
  2. Loop over those items to retrieve an HREF field
  3. Query each HREF
  4. Insert in DB (only the data I need, 90% of the data are useless for me)

I am wondering how I should translate those steps into the spring batch way. Should i create 1 tasklet + 1 chunk job, tasklet for the main list + then write href to local file & job read from local file + write to db ? (it's about 10k items only so local file would be ok) Should I create only 1 tasklet where the reader does both query summary + each individual endpoint ? Which one would be the most performant ? I don't need to max perfs, i'm quite new at Spring batch and i'm wondering how to design the processing :)
Thanks !!


EDIT : I cannot use a simple list because the list is not at root level but in a "data" property at root level. Also by "Query each HREF" I meant perform an API call using the HREF value which is a link to the endpoint of a single item data that I must query because i need data from it not present in the 1st list given by the API.


EDIT 2 : See comments on accepted answer for solution.

1 Answers1

1

How to design my process - loop over a list + 1 query for each item - spring batch

You can create a chunk-oriented step as follows:

  • An item reader that returns items from the list (ListItemReader might work)
  • An item processor that enriches items with HREF field
  • A JdbcBatchItemWriter to insert items in the DB

This is a common pattern, and is documented here: Driving Query Based ItemReaders. That said, this pattern works well with small/medium data sets, but not with large data sets as it requires one or more query for each item. The following threads might be helpful with regard to that matter:

Mahmoud Ben Hassine
  • 28,519
  • 3
  • 32
  • 50
  • Thanks for your answer ! I forgot to mention something, the "issue" is that the list is not at root level. JSON root contains 2 properties "meta" which contains metadata + "data" whichs contains the actual list, i cannot read it as a simple list of items :(. Also I think you misunderstood my (bad) explanation. The item list contains an `HREF` which is an URL to another endpoint of the API that I need to call to retrieve the item composition, data not present in the list of all items. Then i need to write this composition in my DB. – Paddy Mariage Aug 02 '22 at 14:14
  • Using a preparatory simple tasklet step, you can download the json payload to a file and read data from disk after preparing it to the format you need. Once the chunk-oriented step processed the data, the file can be removed. This might seem less efficient than reading data from memory, but is actually better in terms of fault-tolerance, because if you job fails, the file can be (re)used to restart the failed step where it left-off, in addition to no requiring to redo another REST call to fetch the data again on restart. – Mahmoud Ben Hassine Aug 03 '22 at 06:19
  • WRT `HREF`, the item processor can inspect the item to get the URL and do the REST call to grab any additional data. This is a typical use case for an item processor. HTH. – Mahmoud Ben Hassine Aug 03 '22 at 06:19
  • Since i don't need big performances and the scope is quite low (10k items) writing the file on the disk is fine. Indeed your answers helped me, I thank you a lot and I will put your answer as solution for my question. Have a good day ! – Paddy Mariage Aug 03 '22 at 06:57