I have a datset of around 15 million rows on which pandas is kind of unable to execute for-loop. I am trying to dask dataframe to speed up the execution time, however, itertation isn't working.
Initial dataframe example:
cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp']
data = [[12003, 1, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
[12003, 2, 446499.51, 214319.76, np.nan, np.nan, 0.00228, 0.00228],
[12003, 3, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183],
[12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
[12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184],
[12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183],
[12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183]
]
bookdf = pd.DataFrame(data, columns = cols)
Desired Output:
cols = ['id', 'cur_age', 'EGI0', 'EXP0', 'PEGI', 'PExp', 'gEGI', 'TotExp', 'position', 'check', 'egix', 'expx']
data = [[12003, 1, 446499.51923, 23.76, np.nan, np.nan, 0.00228, 0.00228, 0, 446499.51923],
[12003, 2, 446499.51923, 32.76, np.nan, np.nan, 0.00228, 0.00228, 1, 447517.89163],
[12003, 3, 446499.51923, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 2, 448338.21855],
[12003, 4, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 3, 449160.04918],
[12003, 5, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 4, 449983.38628],
[12003, 6, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 5, 450808.23260],
[12003, 7, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 6, 451634.59091],
[12003, 8, 446499.51, 214319.76, np.nan, np.nan, 0.00183, 0.00183, 7, 452462.46399],
[12003, 9, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 8, 453294.43921],
[12003, 10, 446499.51, 214319.76, np.nan, np.nan, 0.00184, 0.00184, 9, 454127.94424],
[12014, 1, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 0, 163392.40385],
[12014, 2, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 1, 163765.06788],
[12014, 3, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 2, 164065.25900],
[12014, 4, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 3, 164366.00038],
[12014, 5, 163392.40, 78428.35, np.nan, np.nan, 0.00183, 0.00183, 4, 164667.29304]
]
bookdf = pd.DataFrame(data, columns = cols)
Working code in Pandas which works on small dataset only:
# 'check' column is being created to get first row of each grouped data w.r.t 'id' column.
# I need to take to take first row of each group and do the below calculation for rest of the rows of each group but ```bookdf.group('id).first()``` is not working with the below calculation which basically retains the last value and do the math.
bookdf['check'] = bookdf.groupby(bookdf['id']).cumcount()
bookdf['egix'] = np.where((bookdf.check==0) & (bookdf.PEGI>0), bookdf.PEGI, bookdf.EGI0)
bookdf['expx'] = np.where((bookdf.check==0) & (bookdf.PExp>0), boodf.PExp, bookdf.EXP0)
for ind in bookdf.index:
if boo1df['check'][ind]!=0:
bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])
If I try to run the same code with dask dataframe, It's giving below mentioned error:
for ind in range(0, len(book1df)):
if boo1df['check'][ind]!=0:
bookdf['egix'][ind] = bookdf['egix'][ind-1]*(1 + bookdf['gEGI'][ind])
bookdf['expx'][ind] = bookdf['expx'][ind-1]*(1 + bookdf['TotExp'][ind])
**Error** : Series getitem in only supported for other series objects with matching partition structure.
Is there any way to implement this in dask dataframe or another best way to get the desired output with large Dataset.