very slow work with dataframe, how to avoid

Question

I have an issue with the part of code which seems to work slowly.

I suppose it's because of iterating through a dataframe. Here is the code:

# creating a dataframe for ALL data
df_all = pd.DataFrame() 

for idx, x in enumerate(all_data[0]):
    
    peak_indx_E = ...
    ...
   
    # TODO: speed up!
    # it works slow because of this? How to avoid this problem if I need to output a dataframe
    
    temp = pd.DataFrame(
      {
        'idx_global_num': idx, 
        ...
        'peak_sq_divE': peak_sq_divE
      }, index=[idx]
    )
    df_all = pd.concat([df_all, temp])

Can you give me a suggestion - how can I speed up the execution - I suppose the pd.concat operation is slow.

How to solve this issue?

score 1 · Accepted Answer · answered Feb 08 '23 at 03:52

1

It looks like you're building two panda dataframe objects for each iteration. Instead, you should build list or list of dicts during the iteration, and use that to create the dataframe when you're done iterating.

Example:

df_list = []

for idx, x in enumerate(all_data[0]):
    df_list.append(
        {
            'idx_global_num': idx, 
            ...
            'peak_sq_divE': peak_sq_divE
        }
    )


df_all = pd.DataFrame.from_dict(df_list)

answered Feb 08 '23 at 03:52

warvolin

1,030
6
9

thanks warvolin, it is so simple, but it works. – twistfire Feb 08 '23 at 22:11
speed up 2 times at least – twistfire Feb 08 '23 at 22:12
You can also profile your code to see where it's still slow if you need to make it faster. Check out this link for more info https://stackoverflow.com/questions/582336/how-do-i-profile-a-python-script Depending on what type of object or data `all_data[0]` is there might be more performance improvements. – warvolin Feb 10 '23 at 01:24

very slow work with dataframe, how to avoid

1 Answers1