Suppose we have a scenario with multiple customers and each customer can purchase different products of different quantities. This is illustrated in the dataframe below - with multiple ids and each id may have different column values as shown below:
id item No_Units No_Purchases
1 140 46 1
1 160 16 1
2 160 46 2
3 140 16 1
From this we want to achieve a single row, so that we can view each customer's purchases on a single row. Updated dataframe should based on the following conditions:
- Each item in the items' column should have its own column similar to converting it into dummy variables. Therefore, item_140, item_160 etc. The content in item item_46 would be the item number 46. If the customer doesn't the item, it should be assigned a zero.
- No_Units and No_Purchases should split in relation to the associated item column - eg: No_Units_item_140,No_Units_item_160, No_Purchases_item_140, No_Purchases_item_160.
NB: there are multiple columns in addition to the columns shown above. Output Dataframe:
id item_140 item_160 No_Units_item_140 No_Purchases_140 No_Units_160 No_Purchases_160
1 140 160 46 1 16 2
2 0 160 0 0 46 2
3 140 0 16 1 0 0
The first part of the code is to create the dummy variables:
df = pd.get_dummies(df, columns=['item'],drop_first=True, prefix=['item'])
I have attempted to code the solution, however there are issue with speed,labels, merging and inserting original values into the dummy dataframe:
output = pd.DataFrame()
for customer in train_data['id'].unique():
df_user = df[df['id']==customer]
t = pd.DataFrame()
for feat in ['No_Units','No_Purchases']:
t = pd.DataFrame([df_user['item'],df_user[feat]]).T
t = pd.concat([t,t1],axis=1)
t=t.stack().to_frame().T
## t.columns = t.iloc[0]
t.columns = ['{}_{}'.format(*c) for c in t.columns]
t['id'] = df_user['id']
output = pd.concat([output,t],ignore_index=True)
I do know that dict() will speed this up considerably aswell.