1

I have a pandas data frame of about 100M rows. Processing in parallel works very well on a multi-core machine, with 100% utilization of each core. However, the results of executor.map() is a generator so in order to actually collect the processed results, I iterate through that generator. This is very, very slow (hours), in part because it's single core, in part because of the loop. In fact, it's much slower than the actual processing in the my_function()

Is there a better way (perhaps concurrent and/or vectorized)?

EDIT: Using pandas 0.23.4 (latest at this time) with Python 3.7.0

import concurrent
import pandas as pd

df = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})

with concurrent.futures.ProcessPoolExecutor() as executor:
    gen = executor.map(my_function, list_of_values, chunksize=1000)

# the following is single-threaded and also very slow
for x in gen:
    df = pd.concat([df, x])  # anything better than doing this?
return df
wishihadabettername
  • 14,231
  • 21
  • 68
  • 85

1 Answers1

1

Here is a benchmark related to your case: https://stackoverflow.com/a/31713471/5588279

As you can see, concat(append) multiple times is very inefficient. You should just do pd.concat(gen). I believe the underlyig implementation will preallocate all needed memory.

In your case, the memory allocation is done everytime.

Sraw
  • 18,892
  • 11
  • 54
  • 87
  • I tried `pd.concat([df, gen])` but got `TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid`. That's fundamentally the crux of the question, whether something like this exists. – wishihadabettername Oct 30 '18 at 17:11
  • @wishihadabettername Try to convert it into a list first. – Sraw Oct 30 '18 at 17:15
  • It worked; `list()` was not needed after all but I had to do it in two stages, first convert the generator to its own data frame, `df2 = pd.concat(gen)` then concatenate it with the existing data frame, `df_final = pd.concat([df, df2])`. This second step is outside of the fundamental question but was present in the question so I'm making it explicit here. Thanks @Sraw and @ALoltz. – wishihadabettername Oct 30 '18 at 19:00
  • Yeah, I thought it should work without converting. And I believe it is much faster now? – Sraw Oct 30 '18 at 19:30
  • Yes, much faster. – wishihadabettername Oct 30 '18 at 20:03