Efficiently consolidate results of concurrent.futures parallel execution?

Question

I have a pandas data frame of about 100M rows. Processing in parallel works very well on a multi-core machine, with 100% utilization of each core. However, the results of executor.map() is a generator so in order to actually collect the processed results, I iterate through that generator. This is very, very slow (hours), in part because it's single core, in part because of the loop. In fact, it's much slower than the actual processing in the my_function()

Is there a better way (perhaps concurrent and/or vectorized)?

EDIT: Using pandas 0.23.4 (latest at this time) with Python 3.7.0

import concurrent
import pandas as pd

df = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})

with concurrent.futures.ProcessPoolExecutor() as executor:
    gen = executor.map(my_function, list_of_values, chunksize=1000)

# the following is single-threaded and also very slow
for x in gen:
    df = pd.concat([df, x])  # anything better than doing this?
return df

I think just `pd.concat(gen)` will work and should be faster, though this will likely still be rather slow overall. — ALollz, Oct 30 '18 at 15:52
I tried (please see the comment for Sraw's answer) but it's not supported. — wishihadabettername, Oct 30 '18 at 17:13
Why are you still putting `df` in the call to `concat`. It should just be `df = pd.concat(gen)` — ALollz, Oct 30 '18 at 17:19
Oh, we were appending these to an existing df. You're right, it's not needed here for the purpose of the question, I'll try now. — wishihadabettername, Oct 30 '18 at 17:23

Sraw · Accepted Answer · 2018-10-30T19:29:59.427

1

Here is a benchmark related to your case: https://stackoverflow.com/a/31713471/5588279

As you can see, concat(append) multiple times is very inefficient. You should just do pd.concat(gen). I believe the underlyig implementation will preallocate all needed memory.

In your case, the memory allocation is done everytime.

edited Oct 30 '18 at 19:29

answered Oct 30 '18 at 15:58

Sraw

18,892
11
54
87

I tried `pd.concat([df, gen])` but got `TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid`. That's fundamentally the crux of the question, whether something like this exists. – wishihadabettername Oct 30 '18 at 17:11
@wishihadabettername Try to convert it into a list first. – Sraw Oct 30 '18 at 17:15
It worked; `list()` was not needed after all but I had to do it in two stages, first convert the generator to its own data frame, `df2 = pd.concat(gen)` then concatenate it with the existing data frame, `df_final = pd.concat([df, df2])`. This second step is outside of the fundamental question but was present in the question so I'm making it explicit here. Thanks @Sraw and @ALoltz. – wishihadabettername Oct 30 '18 at 19:00
Yeah, I thought it should work without converting. And I believe it is much faster now? – Sraw Oct 30 '18 at 19:30
Yes, much faster. – wishihadabettername Oct 30 '18 at 20:03

Efficiently consolidate results of concurrent.futures parallel execution?

1 Answers1

Linked