6

Thousands of dfs of consistent columns are being generated in a for loop reading different files, and I'm trying to merge / concat / append them into a single df, combined:

combined = pd.DataFrame()

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    combined = pd.concat([combined, df])

This is initially fast but slows as combined grows, eventually becoming unusably slow. This answer on how to append rows explains how adding rows to a dict and then creating a df is most efficient but I can't figure out how to do that with to_dict.

What's a good way to to this? Am I approaching this the wrong way?

3 Answers3

10

The fastest way is building a list of dictionaries and building the dataframe only once at the end:

rows = []

for i in range(1, 1000):
    # Instead of generating a dataframe, generate a dictionary
    dictionary = generate_dictionary()
    rows.append(dictionary)

combined = pd.DataFrame(rows)

This is about 100 times faster that concatenating dataframes, as is proved by the benchmark here.

Francesco Pasa
  • 511
  • 6
  • 14
7

You can create list of DataFrames and then use concat only once:

dfs = []

for i in range(1,1000): # demo only
    global combined
    generate_df() # df is created here
    dfs.append(df)

combined = pd.concat(dfs)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2
  • Use concat only once at the end.
  • Sort the index of each DataFrame. In my production code this sort didn't take long yet reduced the processing time of concat from 10 + seconds to less than one second!

dfs = []

for i in range(1,1000): # demo only
    global combined
    df = generate_df() # df is created here
    df.sort_index(inplace=True)    
    dfs.append(df)

combined = pd.concat(dfs)
ChaimG
  • 7,024
  • 4
  • 38
  • 46