2

I have a datset of around 15 million rows on which pandas is kind of unable to execute for-loop. I am trying to dask dataframe to speed up the execution time, however, itertation isn't working.

Initial dataframe example:

cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp']
data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 2, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
        [12003, 3, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183]
]
bookdf = pd.DataFrame(data, columns = cols)

Desired Output:

cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp', 'position', 'check', 'egix', 'expx']
data = [[12003, 1, 446499.51923, 23.76, np.nan, np.nan, 0.00228, 0.00228, 0, 446499.51923],
        [12003, 2, 446499.51923, 32.76, np.nan, np.nan, 0.00228, 0.00228, 1, 447517.89163],
        [12003, 3, 446499.51923, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 2, 448338.21855],
        [12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 3, 449160.04918],
        [12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 4, 449983.38628],
        [12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 5, 450808.23260],
        [12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 6, 451634.59091],
        [12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 7, 452462.46399],
        [12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 8, 453294.43921],
        [12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 9, 454127.94424],
        [12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 0, 163392.40385],
        [12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 1, 163765.06788],
        [12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 2, 164065.25900],
        [12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 3, 164366.00038],
        [12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 4, 164667.29304]
]
bookdf = pd.DataFrame(data, columns = cols)

Working code in Pandas which works on small dataset only:

# 'check' column is being created to get first row of each  grouped data w.r.t 'id' column. 
# I need to take to take first row of each group and do the below calculation for rest of the rows of each group but ```bookdf.group('id).first()```  is not working with the below calculation which basically retains the last value and do the math.

bookdf['check'] =  bookdf.groupby(bookdf['id']).cumcount()
bookdf['egix']  = np.where((bookdf.check==0) & (bookdf.PEGI>0), bookdf.PEGI, bookdf.EGI0)
bookdf['expx']  = np.where((bookdf.check==0) & (bookdf.PExp>0), boodf.PExp, bookdf.EXP0)
for ind in bookdf.index:
    if boo1df['check'][ind]!=0:
        bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
        bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])

If I try to run the same code with dask dataframe, It's giving below mentioned error:

for ind in range(0, len(book1df)):
    if boo1df['check'][ind]!=0:
        bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
        bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])

**Error** : Series getitem in only supported for other series objects with matching partition structure.

Is there any way to implement this in dask dataframe or another best way to get the desired output with large Dataset.
  • It will be easier if you present a fully reproducible dataframe so that users can run the code and try to see the error themselves. At first glance, without knowing too much else, I would recommend you try and use the pandas shift() method alongside apply(). See https://stackoverflow.com/questions/10982089/how-to-shift-a-column-in-pandas-dataframe – HMReliable May 17 '21 at 19:56
  • @HMReliable, Hi, I have attached the reproducible dataframe as asked. Please help me out with this. I have tried shift and apply() but they are not working as i have to iterate over the whole column in order to get the desired calculation. –  May 17 '21 at 21:50

3 Answers3

1

One option is to get rid of the loop altogether.

# this creates the mask of interest
mask = bookdf['check'] != 0

# now we can apply the mask with .loc
bookdf.loc[mask, 'egix'] = bookdf.loc[mask, ['egix']].shift(-1) * (bookdf.loc[mask, ['gEGI']])

bookdf.loc[mask, 'expx'] = bookdf.loc[mask, ['expx']].shift(-1) * (bookdf.loc[mask, ['TotExp']])

This should work for both pandas and dask dataframes.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Hi @SultanOrazbayev, Thanks for the above simplified code. However this code is returning the expected output that is basically the multiplication of previous row with 1 + current row. Is there anyway i can get that as mentioned in above reproducible desired output. –  May 17 '21 at 22:10
  • It's not clear how the 'check' column is generated. – SultanOrazbayev May 17 '21 at 22:29
  • The 'check' column was generated to distinguish the first row of each group w.r.t to “id ” column to do the calculation for rest of the rows in groups –  May 18 '21 at 00:41
  • I tried running your pandas code, but it does not work since columns egix/expx are not defined. – SultanOrazbayev May 18 '21 at 03:50
  • I have updated the above code which provide details how egix/expx is being generated. Please have a look –  May 18 '21 at 11:08
0

Since you need to access the previous values of 'egix' and 'expx' in each group, create two new columns to store these so that your calculations are done efficiently. And then use the apply method across all the columns of the df:

bookdf['egix_prev'] = bookdf.groupby('id')['egix'].shift(1)
bookdf['expx_prev'] = bookdf.groupby('id')['expx'].shift(1)

bookdf['egix'] = bookdf.apply(lambda x:x['egix_prev']*(1+x['gEGI']) if x['check']!=0 else x['egix'],axis=1)
bookdf['expx'] = bookdf.apply(lambda x:x['expx_prev']*(1+x['TotExp']) if x['check']!=0 else x['expx'],axis=1)
HMReliable
  • 871
  • 5
  • 11
  • Thanks for the code @HMReliable. The shift doesn't work in this case because 'egix' keeps updating for each row after each calcualation, It only use first row value of 'egix' for next row calculation but for the next row, it should use updated 'egix' value. That's why i used the for loop to get previous index and make changes in current index for the entire column –  May 18 '21 at 13:12
  • Please see my latest answer @Ankit Chaudhary – HMReliable May 18 '21 at 19:26
0

If you want to efficiently iterate on rows of a large table, it is almost always better to use df.iterrows(). The code below should be about 20x faster than using an external for loop.

row_list = []
for row in bookdf.iterrows():
    if row[1]['check']!=0:
        row[1]['egix'] = row_list[-1][1]['egix']*(1 + row[1]['gEGI'])
        row[1]['expx'] = row_list[-1][1]['expx']*(1 + row[1]['TotExp'])
    row_list.append(row)
rebuilt_df = pd.DataFrame([row[1] for row in row_list]) 
HMReliable
  • 871
  • 5
  • 11