1

I have a dataset of black Friday sales. The columns are User_ID, Product_ID, Gender, Occupation, Product_Category, Purchase, Marital_Status, etc. After analyzing the data, I found that the attribute User_ID has redundant entries (i.e. Single customer buying multiple goods). The total number of entries is 537377 and after I apply df = df.groupby('User_ID'), the number of entries is reduced to 5891. I want to extract all the unique rows (i.e. a unique row per customer)from the pandas groupby object. Is there any way to do so?

I tried summing up each purchase amount corresponding to unique User_ID, but that does not help.

df = df.groupby('User_ID')
df['Purchase'].transform('sum')
for key, item in df:
    print(df.get_group(key), "\n\n")

After executing the above code, the result I get is:

        User_ID Gender   Age  Occupation City_Category  \ 
0       1000001      F  0-17          10             A   
1       1000001      F  0-17          10             A   
2       1000001      F  0-17          10             A   
3       1000001      F  0-17          10             A   
39180   1000001      F  0-17          10             A 
4       1000002      M  55+           16             C   
39181   1000002      M  55+           16             C   
39182   1000002      M  55+           16             C   
39183   1000002      M  55+           16             C   
39184   1000002      M  55+           16             C   
78147   1000002      M  55+           16             C 

        Product_Category_2  Product_Category_3  Purchase  
0                      0.0                 0.0      8370  
1                      6.0                14.0     15200  
2                      0.0                 0.0      1422  
3                     14.0                 0.0      1057  
39180                  4.0                 8.0     12842 
4                      0.0                 0.0      7969  
39181                 17.0                 0.0      6187  
39182                 16.0                 0.0     10074  
39183                  8.0                14.0      5260  
39184                 16.0                 0.0      7927  
78147                 16.0                 0.0      7791  

What I actually want is after dropping product_category_2, 3 is with attribute purchase containing the sum of the total money spent

        User_ID Gender   Age  Occupation City_Category  Purchase
0       1000001      F  0-17          10             A     38891
1       1000002      M  55+           16             C     37239
  • 1
    that is just groupby sum , check df.groupby(['User_ID', 'Gender', 'Age', 'Occupation', 'City_Category']).Purchase.sum() – BENY May 20 '19 at 04:27
  • Thank you very much. That worked fine for me. – Sahil Nathani May 20 '19 at 04:32
  • you can use sort and drop_duplicate for performance (faster) like this ` df1 =df.sort_values(by=['User_ID']).drop_duplicates('User_ID', keep='first').sum() ` – Omer Anisfeld May 20 '19 at 07:09
  • use this answer https://stackoverflow.com/questions/55932560/pandas-core-groupby-dataframegroupby-idxmin-is-very-slow-how-can-i-make-my-c – Omer Anisfeld May 20 '19 at 07:26

0 Answers0