1

Suppose we have a scenario with multiple customers and each customer can purchase different products of different quantities. This is illustrated in the dataframe below - with multiple ids and each id may have different column values as shown below:

   id   item    No_Units     No_Purchases
    1      140         46            1
    1      160         16            1
    2      160         46            2
    3      140         16            1

From this we want to achieve a single row, so that we can view each customer's purchases on a single row. Updated dataframe should based on the following conditions:

  1. Each item in the items' column should have its own column similar to converting it into dummy variables. Therefore, item_140, item_160 etc. The content in item item_46 would be the item number 46. If the customer doesn't the item, it should be assigned a zero.
  2. No_Units and No_Purchases should split in relation to the associated item column - eg: No_Units_item_140,No_Units_item_160, No_Purchases_item_140, No_Purchases_item_160.

NB: there are multiple columns in addition to the columns shown above. Output Dataframe:

   id item_140   item_160   No_Units_item_140     No_Purchases_140  No_Units_160    No_Purchases_160
    1     140        160            46                    1             16                 2
    2     0          160            0                     0             46                 2           
    3     140        0              16                    1             0                  0

The first part of the code is to create the dummy variables:

df = pd.get_dummies(df, columns=['item'],drop_first=True, prefix=['item'])

I have attempted to code the solution, however there are issue with speed,labels, merging and inserting original values into the dummy dataframe:

 output = pd.DataFrame()
 for customer in train_data['id'].unique():
     df_user = df[df['id']==customer]
     t = pd.DataFrame()
     for feat in ['No_Units','No_Purchases']:
         t = pd.DataFrame([df_user['item'],df_user[feat]]).T
         t  = pd.concat([t,t1],axis=1)
         t=t.stack().to_frame().T
        ## t.columns = t.iloc[0]
         t.columns = ['{}_{}'.format(*c) for c in t.columns]
     t['id'] = df_user['id']
     output = pd.concat([output,t],ignore_index=True)

I do know that dict() will speed this up considerably aswell.

Sade
  • 450
  • 7
  • 27

1 Answers1

1

You can use pivot_table function of pandas. You can learn all about it here. https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pivot_table.html

For your dataset you can use the following code.

df = df.pivot_table(['No_Units','No_Purchases','item'],'id','item_ind',fill_value=0) 
df.columns =[s1 + str(s2) for (s1,s2) in df.columns.tolist()] 
df.reset_index(inplace=True) 

It will fill columns with na if there are no purchases. You can use df.fillna(0) to change them to 0.

Shyam Nair
  • 48
  • 5
  • This is not the expected dataframe of OP. – igorkf Mar 04 '21 at 13:54
  • The expected output is not the same. – Sade Mar 04 '21 at 14:20
  • I have modified your response to ensure the output is the same : df = pd.DataFrame({ 'id': [1, 1, 2, 3], 'item_ind': [140, 160,160, 140], 'item': [140, 160,160, 140], 'No_Units': [46, 16, 46, 16], 'No_Purchases': [1,1,2,1]}) df = df.pivot_table(['No_Units','No_Purchases','item'],'id','item_ind',fill_value=0) df.columns =[s1 + str(s2) for (s1,s2) in df.columns.tolist()] df.reset_index(inplace=True) – Sade Mar 04 '21 at 14:43
  • Can you please update your response so that the code produces the correct output? Kindly refer to the code above. I have accepted your answer. – Sade Mar 04 '21 at 14:50
  • Yes sorry, I have updated my answer. – Shyam Nair Mar 04 '21 at 16:01
  • Thanks for sharing this knowledge with me. I have not used pivots prior and it fast. – Sade Mar 04 '21 at 16:04
  • df['item_ind'] = df['item'] – Sade Mar 04 '21 at 16:05