0

There is a data frame with name df which contains repeating rows identified with DICE_SUMMARY_ID.
After I perform some calculations for different columns, I need to write back the results to the original dataframe.
The issue is that df contains over 100k rows and a for loop is very time consuming. Currently, it shows about 3 hours.
How can I reduce the time?

#extract unique ids from dataframe
uniqueIDs = df['DICE_SUMMARY_ID'].unique()

#iterate over the unique ids and calculate 
for i in range(len(uniqueIDs)):
    
    #get a slice of the dataframe at i'th unique id
    uniqueID_df = df.loc[df['DICE_SUMMARY_ID'] == uniqueIDs[i]]

    #calculate sum of all family types
    SINGLE_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10001).sum())
    EXTRA_ADULTS = int((uniqueID_df['FAMILY_TYPE_ID'] == 10003).sum())
    NO_OF_ADULTS = int(SINGLE_ADULTS + EXTRA_ADULTS)
    NO_OF_DEPENDENTS_U_16 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20001).sum())
    NO_OF_DEPENDENTS_16_TO_18 = int((uniqueID_df['FAMILY_TYPE_ID'] == 20002).sum())


    #get the array of indexes of each unique uid in the tuple
    #each unique uid has 10 - 20 rows in the original     df, 
    #given that there are over 100k records, this becoming very time consuming
    indices = np.where(df["DICE_SUMMARY_ID"] == uniqueIDs[i])[0]
    

    for j in indices:
#         #insert value in column at index for each repeating index
        df['NO_OF_ADULTS'].iloc[j] = NO_OF_ADULTS
        df['NO_OF_DEPENDENTS_U_16'].iloc[j] = NO_OF_DEPENDENTS_U_16
        df['NO_OF_DEPENDENTS_16_TO_18'].iloc[j] = NO_OF_DEPENDENTS_16_TO_18

faster version, but I am still not satisfied

df['NO_OF_ADULTS'].iloc[indices.min():indices.max()] = NO_OF_ADULTS
df['NO_OF_DEPENDENTS_U_16'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_U_16
df['NO_OF_DEPENDENTS_16_TO_18'].iloc[indices.min():indices.max()] = NO_OF_DEPENDENTS_16_TO_18
bibscy
  • 2,598
  • 4
  • 34
  • 82
  • `for i in range(len(uniqueIDs)):` is usually solved by `df.groupby('DICE_SUMMARY_ID')`. You should read more about [it](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html). – Quang Hoang Jun 03 '22 at 16:15
  • @QuangHoang I don't know what you have in mind. If you have the time please post an answer. I have just edited my original question and added a second version for a faster execution, but I'd still be interested to see how the ```groupby``` would work in this instance. – bibscy Jun 03 '22 at 16:40
  • TBH, your question is too much, it takes a lot of time to understand what it does. Seeing you're an experienced user, I would suggest you follow [this guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) first. – Quang Hoang Jun 03 '22 at 16:42

0 Answers0