How to concat thousands of pandas dataframes generated by a for loop efficiently?

Question

Thousands of dfs of consistent columns are being generated in a for loop reading different files, and I'm trying to merge / concat / append them into a single df, combined:

combined = pd.DataFrame()

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    combined = pd.concat([combined, df])

This is initially fast but slows as combined grows, eventually becoming unusably slow. This answer on how to append rows explains how adding rows to a dict and then creating a df is most efficient but I can't figure out how to do that with to_dict.

What's a good way to to this? Am I approaching this the wrong way?

score 10 · Answer 1 · answered Feb 22 '21 at 17:54

The fastest way is building a list of dictionaries and building the dataframe only once at the end:

rows = []

for i in range(1, 1000):
    # Instead of generating a dataframe, generate a dictionary
    dictionary = generate_dictionary()
    rows.append(dictionary)

combined = pd.DataFrame(rows)

This is about 100 times faster that concatenating dataframes, as is proved by the benchmark here.

score 7 · Accepted Answer · answered Jan 18 '21 at 11:19

7

You can create list of DataFrames and then use concat only once:

dfs = []

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    dfs.append(df)

combined = pd.concat(dfs)

answered Jan 18 '21 at 11:19

jezrael

822,522
95
1,334
1,252

1

Ah, fantastic. I incorrectly thought appending dfs of the same name to a list wouldn't work but this is perfect. Thank you! – for_all_intensive_purposes Jan 18 '21 at 11:51

score 2 · Answer 3 · answered Dec 29 '21 at 18:24

Use concat only once at the end.
Sort the index of each DataFrame. In my production code this sort didn't take long yet reduced the processing time of concat from 10 + seconds to less than one second!

dfs = []

for i in range(1,1000): # demo only
    global combined
    df = generate_df() # df is created here
    df.sort_index(inplace=True)    
    dfs.append(df)

combined = pd.concat(dfs)

How to concat thousands of pandas dataframes generated by a for loop efficiently?

3 Answers3