I have 60,000 .txt files each of 15-20 KBs(total data is 30 GBs) I want to apply some data extraction logic in each file and store the result in the database.
I tried running the script sequentially but it is taking a lot of time, I think I/O is the bottleneck so I am exploring python multiprocessing and multithreading libraries like dask to do the job.
Is there any better approach for this use case?
I tried implementing the producer and consumer approach, one thread us loading data into the memory and the worker is consuming the data applying extraction logic and writing on the database.
Am I reinventing the wheel? Is there any python library to do the task?