I have a pandas data frame of about 100M rows. Processing in parallel works very well on a multi-core machine, with 100% utilization of each core. However, the results of executor.map()
is a generator so in order to actually collect the processed results, I iterate through that generator. This is very, very slow (hours), in part because it's single core, in part because of the loop. In fact, it's much slower than the actual processing in the my_function()
Is there a better way (perhaps concurrent and/or vectorized)?
EDIT: Using pandas 0.23.4 (latest at this time) with Python 3.7.0
import concurrent
import pandas as pd
df = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})
with concurrent.futures.ProcessPoolExecutor() as executor:
gen = executor.map(my_function, list_of_values, chunksize=1000)
# the following is single-threaded and also very slow
for x in gen:
df = pd.concat([df, x]) # anything better than doing this?
return df