Speed Up Pandas Block

Question

I could use some help speeding up this block of code. I'm assuming the concat ops are what's slowing the code down, but I'm not sure. This one loop takes about 20x the time to execute the rest of the code in my function.

df = pd.DataFrame()
for index, row in p_data_df.iterrows():
    test_df = log_df.loc[row['Mid-C']].to_frame().transpose()
    if 'S' not in test_df.columns:
        test_df.insert(0, 'S', row.loc['S'])
        test_df.insert(1, 'C #', row.loc['C #'])
        test_df.insert(2, 'Num', row.loc['Num'])

    df = pd.concat([df, test_df], axis=0)

Why are you trying to hand-concat a data frame instead of using a chain of `.map` and `.filter` on the source frame? — 9000, Oct 02 '19 at 16:28
I'm not familiar with map and filter. My pandas exposure is pretty limited. — John, Oct 02 '19 at 16:30
I'm currently using the code I used in the answer I posted below. I marked your answer as the solution though as it clearly explains where I went wrong and how to avoid it. — John, Oct 17 '19 at 15:22

Massifox · Accepted Answer · 2019-10-02T17:07:35.437

Never call pd.concat inside a for-loop. It leads to quadratic copying: concat returns a new DataFrame. Space has to be allocated for the new DataFrame, and data from the old DataFrames have to be copied into the new DataFrame.

So with your dataframe having N rows, you would have O (N^2) copies needed to complete the cycle.

Use a list of dictionaries or a list of lists instead of the dataframe to accumulate the results, and outside the for-loop create your dataframe with the list of results. In this way you will save a lot of execution time, pandas is not done for this.

Here's how you could do it:

list_res = []
for index, row in p_data_df.iterrows():
    test_df = log_df.loc[row['Mid-C']].to_frame().transpose()
    if 'S' not in test_df.columns:
        test_df.insert(0, 'S', row.loc['S'])
        test_df.insert(1, 'C #', row.loc['C #'])
        test_df.insert(2, 'Num', row.loc['Num'])
    list_res.append(test_df)

df = pd.concat(list_res, axis=0)

More tips to speed up your code

iterrows is the slowest possible method for iterating the dataframe, since each line must be transformed into a series. If you use itertuples this does not happen. You can use itertuples without changing your code too much but gaining in performance.

There are other methods (vectorization, apply function, Cython...), which would require a slightly wider modification of your code but which would allow you to have a more efficient code. I leave you this link for a little more information.

score 1 · Answer 2 · answered Oct 02 '19 at 16:30

The concat as used is resulting in a quadratic copy. Each time you add a row, this operation returns a copy of the previous dataframe plus a new row. A better approach is to keep a list of all the rows, and then do only one final concatenation at the end.

df = []
for index, row in p_data_df.iterrows():
    test_df = log_df.loc[row['Mid-C']].to_frame().transpose()
    if 'S' not in test_df.columns:
        test_df.insert(0, 'S', row.loc['S'])
        test_df.insert(1, 'C #', row.loc['C #'])
        test_df.insert(2, 'Num', row.loc['Num'])
    df.append(test_df)
df = pd.concat(df, axis=0)

score 1 · Answer 3 · answered Oct 02 '19 at 18:10

1

I was able to come up with this, which seems to be at least an order of magnitude faster:

df = pd.concat([p_data_df.loc[:, ['S', 'C #', 'Num']], log_df[log_df['MD'].isin(p_data_df['Mid-C'].values)]], sort=False, axis=1)

answered Oct 02 '19 at 18:10

John

485
3
5
16

Speed Up Pandas Block

3 Answers3

More tips to speed up your code