0

I'm new to Python (2.7). I have written a simple program that uses tabula-py and PyPDF2 to read tables in about 10,000 500-page pdfs and write a .csv for each. Eventually I plan to write the extracted data to an SQL database, but for now just .csvs. I have access to multiple cores via my university's remote environment, and I would like to speed up the program by running in parallel. The parallel processes should be able to run completely independently, each working separately on a different pdf and writing to a separate csv. The order of processing does not matter

I have read some on multithreading and multiprocessing, but I think my problem should not require something sophisticated because the same resources never need to be accessed simultaneously.

What is the best approach to employing, say, 8 CPU cores to speed up this task?

1 Answers1

0

I am not sure if you're going to be IO or CPU limited, but if it turns out to be CPU limited (which I reckon is your best bet) threading might not be the option you seek.

When to use threads vs processes?

  • Processes speed up Python operations that are CPU intensive because they benefit from multiple cores and avoid the GIL.
  • Threads are best for IO tasks or tasks involving external systems because threads can combine their work more efficiently. Processes need to pickle their results to combine them which takes time.
  • Threads provide no benefit in python for CPU intensive tasks because of the GIL.

Another helpful kickstart can be found here and here. The 2015 upvoted answer here might also be of assistance, it summarizes examples and explanation from Chris Kiehl. For speedy reference;

from multiprocessing.dummy import Pool as ThreadPool 
pool = ThreadPool(4) 
results = pool.map(my_function, my_array)

Which is the multithreaded version of:

results = []
for item in my_array:
    results.append(my_function(item))

Good luck!

MarcelTon
  • 97
  • 1
  • 1
  • 11