How to reduce the time taken by this loop for huge csv data with 26M rows?

Question

n_df = pd.DataFrame(columns=["Geoid", "Occupancy", "BDG_Weights", "CTS_Weights", "BI_Weights"])

for idx, row in dd.iterrows():
    for token in ['SFD_', 'MFD_', 'COM_', 'IND_']:
        n_df = pd.concat([n_df, pd.DataFrame([[idx, token.split('_')[0], row[token+'STR_WT'], row[token+'CNT_WT'], row[token+'BI_WT']]], columns=["Geoid", "Occupancy", "BDG_Weights", "CTS_Weights", "BI_Weights"])])
break

Here 'dd' is a dataframe with 26million rows and this loop is taking hours to concatenate. is there any other way to concatenate soon.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Oct 28 '22 at 12:24
Hi, Satya. Some strong cases have been made for not using iterrows: https://stackoverflow.com/a/55557758/246801, https://stackoverflow.com/q/24870953/246801. Please edit your post to include a **small** textual sample of the input and expected output. — Zach Young, Oct 28 '22 at 21:32

Andrew Ryan · Answer 1 · 2022-10-31T16:03:10.833

some 'minor' improvements that you could do is to use f-strings for your string labeling e.x. token+'STR_WT' becomes f{token}STR_WT.

you can remove the token.split('_')[0] with token[:-1] as using dot operators are generally slower.

column_names = ["Geoid", "Occupancy", "BDG_Weights", "CTS_Weights", "BI_Weights"] # storing once and using this value instead of creating inside the for loops
n_df = pd.DataFrame(columns=column_names)

# adding variables that will be the same inside the inner for loop so that you are not grabbing the variable with f-strings each time
token_names = [[token[:-1], f'{token}STR_WT', f'{token}CNT_WT', f'{token}BI_WT'] for token in ['SFD_', 'MFD_', 'COM_', 'IND_']] # so all of the strings are created at once and can be referenced

for idx, row in dd.iterrows():
    for token in token_names:
        n_df = pd.concat([n_df, pd.DataFrame([[idx, token[0], row[token[1]], row[token[2]], row[token[2]]]], columns=column_names)])
break

you could also store pd.concat and pd.DataFrame as local variables before the for loop to remove the constant lookup of functions.

Though these are relatively small tweaks. The largest tweaks would be to go through why you need to do these steps in this order, are there things that you can filter out so that you don't have to go through all of these entries?

Unsure completely but it seems that what you are trying to do is to concat entire columns (those with token names) into a singular dataframe. It might be worth checking out https://sparkbyexamples.com/pandas/pandas-create-new-dataframe-by-selecting-specific-columns/ to select all of the columns of a specific token at once and concat all of those token values into one dataframe together.

Thanks for the suggestions, its optimized a lot but still taking a lot of time as the data is huge. Can you suggest me how to use pandas vectorization for this task to be done? — Satya Dalei, Oct 31 '22 at 05:35
@SatyaDalei can you include a sample of your data to better assist you with optimizations as Zach Young suggested. — Andrew Ryan, Oct 31 '22 at 06:07

How to reduce the time taken by this loop for huge csv data with 26M rows?

1 Answers1