How do I parallelize a loop that appends to a for loop that appends to a Pandas dataframe?

Question

I'm running a for loop that calls a function which returns a Pandas series. On each iteration of the for loop I'm appending that row to a final Dataframe output.

Inside the function I calculate some stuff and query a SQL database.

How can I run this on 4 or 5 parallel threads and still append to the same final dataframe?

df_final = pd.DataFrame()

for i in range(0,10000):
    series = myFunction(A,B,C)
    df_final = df_final.append(series)

Which database? Which interface/library do you use to access the database? — ypnos, Aug 18 '22 at 17:06
Don't `append`. `df_final = pd.DataFrame([myFunction(A,B,C) for i in range(10000)])` is much better and faster. — Quang Hoang, Aug 18 '22 at 17:09
@ypnos I'm using sqlalchemy to interact with a SQL database from which I need to query data for each record. — ortunoa, Aug 18 '22 at 17:37
@QuangHoang I've tried your way and it doesn't really run any faster, thanks though. — ortunoa, Aug 18 '22 at 17:38
1. It should run faster since `for ... : df.append` is O(n^2) because it copies the data every iteration. 2. look into [multithread](https://stackoverflow.com/questions/3033952/threading-pool-similar-to-the-multiprocessing-pool), 3. look into asynchronous requests as well. — Quang Hoang, Aug 18 '22 at 17:43
@QuangHoang you're probably right that it runs faster, but it still takes more than 5 minutes and I'd need it to take less than 1. Thanks again — ortunoa, Aug 18 '22 at 17:45
Doing 10_000 SQL requests is clearly not efficient (because most DB has ACID transactions). Consider doing one big request. Using multiple threads will certainly not help but add more overhead. If all request are different, then consider doing few transactions with many requests per transactions. — Jérôme Richard, Aug 18 '22 at 18:34
As has been said, your biggest bottle neck is your dB queries. Second will be the df.append. Your best solution will be to get you'd dB data in memory (pd.read_sql) and perform your queries against the dataframe. If you can't do that, a small improvement is to not convert your sqlalchemy response to a series but append the returned dict to a list. You can then feed that list to Pd.DataFrame — rayad, Aug 20 '22 at 20:28
I've changed the code to only do one SQL query and store the results in a dataframe. It still takes a very long time, I think it's the append portion of the code. — ortunoa, Aug 22 '22 at 13:59
@ortunoa Did you read the previous comments about the quadratic execution time (`O(n^2)`)? Besides, I do not see how the append loop can still exist if you do one query: you can just concatenate the two dataframes. — Jérôme Richard, Aug 22 '22 at 16:33

How do I parallelize a loop that appends to a for loop that appends to a Pandas dataframe?

0 Answers0