Design advice on Process, parallelly large volume files

Question

I am looking for design advise on below use case. I am designing an application which can process EXCEL/CSV/JSON files. They all contain same columns/attributes. There are about 72 columns/attributes. These files may contain up to 1 million records.

Now i have two options to process those files.

Option 1

Service 1: Read the content from given file, convert each row into JSON save the records into SQL table by batch processing (3k records per batch).
Service 2: Fetch those JSON records from database table (which are saved in step 1), process (validation and calculation) them and save final results into separate table.

Option 2 (using Rabbit MQ)

Service 1: Read the content from given file, and send every row as a message into Queue. Let say if file contains 1 million records then this service will be sending 1 million messages into Queue.
Service 2: Listen to Queue created in step 1, and process those messages (Validation and calculation) and save final results into separate table.

POC experience with Option 1:

It took 5 minutes to read and batch saving the data into table for 100K records. (job of service 1)
If application is trying to process multiple files parallelly which contain 200K records in each of them some times i am seeing deadlocks.
1. No indexes or relation ships are created on this batch processing table.
2. Saving 3000 records per batch to avoid table locks.
While services are processing, results are trackable and query the progress. Let say, For "File 1.JSON" - 50000 records are processed successfully and remaining 1000 are IN progress.
If Service 1 finish the job correctly and if something goes wrong with service 2 then we still have better control to reprocess those records as they are persisted in the database.
I am planning to delete the data in batch processing table with a nightly SQL job if all records are already processed by service 2 so this table will be fresh and ready to store the data for the next day processing.

POC experience with option 2:

To produce (service 1) and consume messages (service 2) for 100k record file it took around 2 hr 30 mins.
No storage of file data into the database so no deadlocks (like option 1)
Results are not trackable as much as option 1, while services are processing the records. - To share the status with clients who sent the file for processing. We can see the status of messages on Rabbit MQ management screen for monitoring purpose.
If service 1 partially read the data from a given file and error out due to some issues then there is no chance of roll back already published messages in Rabbit MQ per my knowledge so consumer keep working on those published messages..

I can horizontally scale the application on both of these options to speed up the process.

Per above facts both options have advantages and disadvantages. Is it a good use case to use Rabbit MQ ? Is it advisable to produce and consume millions of records through RabbitMQ ? Is there a better way to deal with this use case apart from these 2 options. Please advise.

*** I am using .Net Core 5.0 and SQL server 2019. Service 1 and Service 2 are .net core worker services (windows jobs). All tests are done on my local machine and Rabbit MQ is installed on Docker (docker is on my local machine).

With loading the data quickly: https://stackoverflow.com/q/46373895/495455, with using a Queue or Batch depends. You can use queue to handle excessive writes, and Batching to speed up bulk inserts so I'd say a combination of both if you can. — Jeremy Thompson, Jul 09 '21 at 04:29
Jeremy - do you see any design concern with processing CSV file data through Rabbit MQ ? is it advisable to process multiple files through Queues. I had no experience with Rabbit MQ other completing POC. Please share if there are side effects to it. — user1551655, Jul 13 '21 at 02:23

Design advice on Process, parallelly large volume files

0 Answers0