Python for loop taking forever to run with huge dataset

Question

The df is formatted in this manner:

Zip Code | State | Carrier | Price
__________________________________
xxxxx    |  XX   |  ABCD   |  12.0
xxxxx    |  XX   |  TUSD   |  15.0
xxxxx    |  XX   |  PPLD   |  17.0

The Code:

carrier_sum = []
unique_carrier = a_df['Carrier'].unique()
for i in unique_carrier:
    x=0
    for y, row in a_df.iterrows():
        x = a_df.loc[a_df['Carrier'] == i, 'Prices'].sum()
    print(i, x)
    carrier_sum.append([i,x])

This is my code, at first it makes a unique_carrier list. Then for each of the carriers it iterrows() through the df to get the 'Price' and sum it returning the carrier_sum to the empty df I created.

The problem is it seems to take forever, I mean I ran it once and it took over 15 minutes just to get the sum for the first one unique carrier sum and there are 8 of them.

What can I do to make it more efficient?

The dataset is over 300000 rows long.

One way that I thought of is to go ahead and set a list with the unique carriers identified beforehand since I don't really need to look for it in the df, another thing I thought of is to organize the main dataset by carrier name alphabetically, and make the unique carrier list line up with how it is in the dataset.

Thank you for reading.

You can do this without looping using `groupby` and `sum`. Have you used those? — Tim Roberts, Dec 22 '21 at 04:49

score 0 · Answer 1 · edited Dec 22 '21 at 05:37

0

This solution can work for you

df.groupby('Carrier')['Price'].sum()

edited Dec 22 '21 at 05:37

Tamil Selvan

1,600
1
9
25

answered Dec 22 '21 at 04:58

Sakshi Maurya

31
4

Python for loop taking forever to run with huge dataset

1 Answers1