How do I parallelize a for loop that calls a function?

Question

I have the following snippet which iterates over a list of .csv files and then uses a insert_csv_data function which reads, preprocesses and inserts the .csv file's data into a .hyper file (Hyper is Tableau's new in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets):

A detailed interpretation of the insert_csv_data function can be found here

for csv in csv_list:
            insert_csv_data(hyper)

The issue with the above code is that it inserts one .csv file into the .hyper file at a time, which is pretty slow at the moment.

I would like to know if there's a faster or parallel workaround as I'm using Apache Spark for processing on Databricks. I've done some research and found modules like multiprocessing, joblib and asyncio that might work for my scenario, but I'm unsure of how to correctly implement them.

Please Advise

Edit:

Parallel Code:

from joblib import Parallel, delayed
element_run = Parallel(n_jobs=1)(delayed(insert_csv_data)(csv) for csv in csv_list)

https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop — Rani Sharim, Oct 11 '21 at 06:26
I've tried that, my parallel code takes longer than my original code — The Singularity, Oct 11 '21 at 06:28
I would argue (others may disagree) that the general rule for deciding on one of multithreading or multiprocessing is that if your parallel operations are I/O bound use the former otherwise (CPU bound) use the latter. It would be interesting to see your "parallel code" — , Oct 11 '21 at 06:47
@BrutusForcus I have updated my question, Pardon the inexperience, but how do I know if my operations are I/O or CPU Bound? — The Singularity, Oct 11 '21 at 06:52
I set `n_jobs =1` and the error disappeared, I presume this will slow down my process? — The Singularity, Oct 11 '21 at 06:56
@Luke without being able to see your *insert_csv_data* function it's impossible to say whether you're likely to be I/O or CPU bound. — , Oct 11 '21 at 07:15
The `insert_csv_data` function executes a SQL command that copies from the `.csv` to `.hyper` — The Singularity, Oct 11 '21 at 07:18

score 1 · Accepted Answer · 2021-10-11T07:41:40.023

1

This does not directly answer the question but demonstrates how multiprocessing and multithreading are easily interchangeable using the concurrent.futures module. Note that the two loops achieve exactly the same thing and that the only difference between the two sections of code the is the work manager class.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor


def tfunc(n):
    return n * n


N = 1_000


def main():
    with ThreadPoolExecutor() as executor:
        for future in [executor.submit(tfunc, n) for n in range(N)]:
            future.result()

    with ProcessPoolExecutor() as executor:
        for future in [executor.submit(tfunc, n) for n in range(N)]:
            future.result()


if __name__ == '__main__':  
    main()

edited Oct 11 '21 at 07:41

answered Oct 11 '21 at 07:11

No because the two lines of code you're presented make no sense. What is *hyper*? You don't seem to be using *csv*. What does *insert_csv_data* do? Crucial questions that need to be answered – Oct 11 '21 at 07:17
In that case I suggest you try both techniques and measure the outcome. – Oct 11 '21 at 07:26
Thank you. Here's a more appropriate [reference](https://github.com/tableau/hyper-api-samples/blob/main/Tableau-Supported/Python/create_hyper_file_from_csv.py) if you have the time. – The Singularity Oct 11 '21 at 07:27
Given that Hyper seems to be an in-memory technology, I would start with multiprocessing – Oct 11 '21 at 07:30
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/238012/discussion-between-brutusforcus-and-luke). – Oct 11 '21 at 07:31
I strongly recommend that you read https://joblib.readthedocs.io/en/latest/parallel.html and also just go ahead and adapt my code for your purposes – Oct 11 '21 at 07:38
`ProcessPoolExecutor` took longer than the usual time, and `ThreadPoolExecutor` didn't work as `Hyper` doesn't support making multiple connections at the same time. Thanks a lot for the assistance, I'll take up the issue with Tableau – The Singularity Oct 11 '21 at 09:09

How do I parallelize a for loop that calls a function?

1 Answers1