I'm new to Python
(2.7). I have written a simple program that uses tabula-py
and PyPDF2
to read tables in about 10,000 500-page pdfs and write a .csv for each. Eventually I plan to write the extracted data to an SQL database, but for now just .csvs. I have access to multiple cores via my university's remote environment, and I would like to speed up the program by running in parallel. The parallel processes should be able to run completely independently, each working separately on a different pdf and writing to a separate csv. The order of processing does not matter
I have read some on multithreading and multiprocessing, but I think my problem should not require something sophisticated because the same resources never need to be accessed simultaneously.
What is the best approach to employing, say, 8 CPU cores to speed up this task?