0

I have 60,000 .txt files each of 15-20 KBs(total data is 30 GBs) I want to apply some data extraction logic in each file and store the result in the database.

I tried running the script sequentially but it is taking a lot of time, I think I/O is the bottleneck so I am exploring python multiprocessing and multithreading libraries like dask to do the job.

Is there any better approach for this use case?

I tried implementing the producer and consumer approach, one thread us loading data into the memory and the worker is consuming the data applying extraction logic and writing on the database.

Am I reinventing the wheel? Is there any python library to do the task?

  • Does this answer your question? [Dask: How would I parallelize my code with dask delayed?](https://stackoverflow.com/questions/42550529/dask-how-would-i-parallelize-my-code-with-dask-delayed) – SultanOrazbayev Sep 22 '21 at 14:21
  • @SultanOrazbayev thanks for taking a look, yes this is one of the approaches i am trying to implement but is a there better approach than this. – Samarth Singh Thakur Sep 22 '21 at 16:31
  • Can you provide a mcve that would help reproduce and diagnose your problem? https://stackoverflow.com/help/minimal-reproducible-example – rrpelgrim Oct 06 '21 at 09:46

0 Answers0