I have a number of CSVs, I need to add a new column to them with values obtained by making calls to Google's Perspective API using the text inside them. What I have is a function,
process_perspective(slice1, slice2, api)
that takes in input a subset of CSVs from my directory (the subset is selected by using the slice1 and slice2 arguments), and, iteratively for each CSV, it queries the API, adds a new column with the values obtained, and saves the resulting CSV in another directory.
To speed up the process, I came up with this other piece of code:
slices = (
(0, 100, api1),
(100, 200, api1),
(200, 300, api2),
(300, 400, api2))
def runner():
with ThreadPoolExecutor(max_workers=4) as executor:
for (slice1, slice2, api) in slices:
executor.submit(process_perspective, slice1, slice2, api)
which runs the previous function on different threads. This does speed up the process a bit, but I was wondering, since I'm also modifying and writing the CSVs, and not only making the API calls, if I could benefit from using multiprocessing instead of multithreading as I've been doing.
Another possible solution I had in mind was to split the CSVs into different chunks, letting each thread make the calls for a chunk of the CSVs, assemble the results at the end, and saving the new CSV, but I don't know if that makes more sense or not: what do you think?
Thank you in advance to everyone and have a good day!