0

I have a number of CSVs, I need to add a new column to them with values obtained by making calls to Google's Perspective API using the text inside them. What I have is a function,

process_perspective(slice1, slice2, api)

that takes in input a subset of CSVs from my directory (the subset is selected by using the slice1 and slice2 arguments), and, iteratively for each CSV, it queries the API, adds a new column with the values obtained, and saves the resulting CSV in another directory.

To speed up the process, I came up with this other piece of code:

slices = (
(0, 100, api1),
(100, 200, api1),
(200, 300, api2),
(300, 400, api2))

def runner():
    with ThreadPoolExecutor(max_workers=4) as executor:
        for (slice1, slice2, api) in slices:
            executor.submit(process_perspective, slice1, slice2, api)

which runs the previous function on different threads. This does speed up the process a bit, but I was wondering, since I'm also modifying and writing the CSVs, and not only making the API calls, if I could benefit from using multiprocessing instead of multithreading as I've been doing.

Another possible solution I had in mind was to split the CSVs into different chunks, letting each thread make the calls for a chunk of the CSVs, assemble the results at the end, and saving the new CSV, but I don't know if that makes more sense or not: what do you think?

Thank you in advance to everyone and have a good day!

norberto
  • 15
  • 4
  • Try it out and compare, this is the best way to answer your question. My instinct says that performance will be better using processes rather than threads. My experience says that the ideal structure is to separate the task into two subtasks: API calls with threads, CSV manipulation with processes. – Michael Ruth May 02 '23 at 15:54
  • @MichaelRuth thank you Micheal, I'll try separate the task the way you suggested and I'll report back – norberto May 02 '23 at 15:56
  • 1
    Try just using processes first, it's really easy and may get you the performance you desire for very little effort. Just replace `ThreadPoolExecutor(max_workers=4)` with `ProcessPoolExecutor(max_workers=os.cpu_count()-1)`. Also take a look at the answers and discussion regarding https://stackoverflow.com/q/27455155/4583620. – Michael Ruth May 02 '23 at 16:00
  • 1
    It really depends on whether `process_perspective()` is IO-bound (disk read/writes & API calls) or CPU-bound. You can also get more precise control of resource utilization by breaking up `process_perspective()` to pool together similar operations of all the calls, e.g. all the disk reads, all the API calls, all the computation, and all the disk writes. – Kache May 02 '23 at 16:34
  • @MichaelRuth thank you again, by using multiprocessing I roughly doubled my performances, I'll try to see if I can optimize it even more by separating the task in two. – norberto May 02 '23 at 16:48
  • @Kache I would have thought it was IO-bound, but since I get better performances by using multiprocessing I'm not that sure about it anymore. Also thank you for the input, I'll see if I can optimize it even more – norberto May 02 '23 at 16:52
  • Multi processing is efficient when you have several CPUs. Else, it's slower than Multi thread. You can also have several processes, 1 per CPU, each one having Multi thread (to preform processing while your CPU wait for the API call answer). About the CSV file, you can manage the access to a single file with a mutex (but hard to manage in Python), or prefer separate files, and merge them at the end. – guillaume blaquiere May 02 '23 at 18:47

0 Answers0