0

I'm having a dataset which as the following

customer    products    Sales
1           a           10
1           a           10
2           b           20
3           c           30

How can I reshape and to do that in python and pandas? I've tried with the pivot tools but since I have duplicated CUSTOMER ID it's not working...

Products                
customerID  a   b   c   
1           10          
1           10          
2              20       
3                   30  



{' update': {209: 'Originator',
 211: 'Originator', 
212: 'Originator',
 213: 'Originator',
 214: 'Originator'}, 
'CUSTOMER ID': {209: 1000368,
 211: 1000368, 212: 1000968, 213: 1000968, 214: 1000968},
 'NET SALES VALUE SANOFI':{209: 426881.0,
 211: 332103.0, 212: 882666.0, 213: 882666.0, 214: 294222.0},
 'PRODUCT FAMILY': {209: 'APROVEL',
 211: 'APROVEL', 212: 'APROVEL', 213: 'APROVEL', 214: 'APROVEL'}, 
'CHANNEL DEFINITION':
 {209: 'PHARMACY', 211: 'PHARMACY', 212: 'PHARMACY', 213: 'PHARMACY', 214: 'PHARMACY'}, 
'index': {209: 209, 211: 211, 212: 212, 213: 213, 214: 214}


CUSTOMER ID 1228675 non-null int64 
DISTRIBUTOR ID 1228675 non-null float64 
PRODUCT FAMILY 1228675 non-null 
object GROSS SALES QUANTITY 1228675 
non-null int64 GROSS SALES VALUE 1228675 
non-null int64 NET SALES VALUE 1228675 
non-null int64 DISCOUNT VALUES 1228675 
non-null int64 CHANNEL DEFINITION 1228675 non-null object

what i tried also : ONLY_PHARMA.pivot_table(values = "NET SALES VALUE ", index = ["CUSTOMER ID"], columns = "PRODUCT FAMILY").reset_index()

what im getting now a mix of float and Int....?? Why?

ID        A           B          C
1000167  NaN   2.380122e+05      244767.466667

or im having : 

ValueError: negative dimensions are not allowed

OR I've done which also return me floats and int:

pvt = pd.pivot_table(ONLY_PHARMA.reset_index(), index=['CUSTOMER ID'],
                              columns='PRODUCT FAMILY', values='NET SALES VALUE' , fill_value='') \
               .reset_index()
SIMON_py
  • 1
  • 4

4 Answers4

2

You can use cumcount with set_index + unstack for reshape:

g = df.groupby(['customer', 'products']).cumcount()
df = (
      df.set_index([g, 'customer', 'products'])['Sales']
       .unstack().sort_index(level=1)
       .reset_index(level=0, drop=True)
      )
print (df)
products     a     b     c
customer                  
1         10.0   NaN   NaN
1         10.0   NaN   NaN
2          NaN  20.0   NaN
3          NaN   NaN  30.0

Notice:

If duplicated values, maybe need aggregation, check how to pivot a dataframe

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Something like: `df.customer.to_frame(name='customer_id').join(df.pivot(columns='products', values='Sales')).fillna('').set_index('customer_id')` might do it – Jon Clements Feb 24 '18 at 17:19
  • Key point is to start with `df.pivot(columns='products', values='Sales')` I think... Heck - even `df.join(df.pivot(columns='products', values='Sales'))` might do... – Jon Clements Feb 24 '18 at 17:24
  • Hi John thanks for your help ! `df.pivot(columns='products', values='Sales') and df.join(df.pivot(columns='products', values='Sales') `are not working.... About this one : `df.customer.to_frame(name='customer_id').join(df.pivot(columns='products', values='Sales')).fillna('').set_index('customer_id')`... i dont understand the df.customer at the beginning? – SIMON_py Feb 24 '18 at 17:50
  • ` Int64Index: 5 entries, 1000368 to 1000968 Data columns (total 1 columns): APROVEL 5 non-null int64 dtypes: int64(1) memory usage: 80.0 bytes` Only returns me one product..? – SIMON_py Feb 24 '18 at 18:50
  • @SIMON_py - I see your last edit and I am confused what is expected output. Can you add it to question? Or wht is problem? can you explain more? – jezrael Feb 24 '18 at 18:51
  • @jezrael IF i use your – SIMON_py Feb 24 '18 at 19:08
  • @jezrael it shows the ids but only one product for some reason... I need to see the values for all the products... Let me explain... Mydataset: I have a columns of ID, one of total sales (42.32), one of the type of product (a,b,c)... Im wondering if the format of my data is causing me problems also? see my edits... – SIMON_py Feb 24 '18 at 19:13
  • @SIMON_py - Your columns are `['Ambar update', 'CUSTOMER ID', 'NET SALES VALUE SANOFI', 'PRODUCT FAMILY', 'SANOFI CHANNEL DEFINITION', 'index']`. What is column `product` ? – jezrael Feb 24 '18 at 19:17
  • @jezrael Also i dont want to group by the same ID to sum... If im grouping my values and doing on pivot its working properly.. But for some reasons it seems that the duplicate ID is the problem. Its product family – SIMON_py Feb 24 '18 at 19:17
  • @SIMON_py - thanks. I think reason why you get only one columns is clear - in column `PRODUCT FAMILY` is always only same value `APROVEL`. so `APROVEL` create new column in new DataFrame. What is `print (df['PRODUCT FAMILY'].eq('APROVEL').sum())` of original `DataFrame` ? – jezrael Feb 24 '18 at 19:26
  • Not sure to understand.. I dont want to sum or mean my products... I want all duplicates ID on a seperate line... – SIMON_py Feb 24 '18 at 21:21
  • @jezrael i want to do exactly this https://codedump.io/share/geAjSS6EcG2l/1/pandas-transposing-one-column-in-a-multiple-column-df . But i dont want to average or count. I want to keep all the rows of my dataframe – SIMON_py Feb 24 '18 at 21:52
2

Here's fairly straight-forward way assuming you have a unique index, given your input of:

   customer products  Sales
0         1        a     10
1         1        a     10
2         2        b     20
3         3        c     30

Pivot it to columnise the products and rejoin to just the customer column on the original frame, eg:

new_df = df[['customer']].join(df.pivot(columns='products', values='Sales'))

This'll give you:

   customer     a     b     c
0         1  10.0   NaN   NaN
1         1  10.0   NaN   NaN
2         2   NaN  20.0   NaN
3         3   NaN   NaN  30.0

Then sort out your indexing / filling blank values.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Nice1! A small modification suggestion: output of `df.pivot(columns='products', values='Sales').set_index(df['customer'])` ressembles more what OP wanted. – Anton vBR Feb 24 '18 at 17:50
  • 1
    @AntonvBR yeah... wasn't *quite* sure exactly what the output was supposed to be - I'm guessing the OP does so they can just index appropriately at the relevant stage... – Jon Clements Feb 24 '18 at 17:51
  • For some reason in using this and `df.pivot(columns='products', values='Sales').set_index(df['customer'])` and Im getting only one column of one products... i should get in columns all my 50 products... – SIMON_py Feb 24 '18 at 18:19
1

Another method using str.get_dummies.

pd.concat([df, df.products.str.get_dummies().multiply(df["Sales"], axis="index")], 
          axis=1)


   customer products    Sales   a   b   c
0   1       a           10      10  0   0
1   1       a           10      10  0   0
2   2       b           20      0   20  0
3   3       c           30      0   0   30

df.products.str.get_dummies() creates dummy variables as follows

    a   b   c
0   1   0   0
1   1   0   0
2   0   1   0
3   0   0   1

We then need to multiply this dummy variable table with df["Sales"]. This is achieved by df.products.str.get_dummies().multiply(df["Sales"], axis="index") (See reference for more information.)

    a   b   c
0   10  0   0
1   10  0   0
2   0   20  0
3   0   0   30

Reference

how to multiply multiple columns by a column in Pandas

Note: to replace 0 with np.nan, you need to add .replace(0, np.nan) like

pd.concat([df, df.products.str.get_dummies().replace(0, np.nan).mul(df["Sales"], axis="index")], axis=1)

Tai
  • 7,684
  • 3
  • 29
  • 49
  • Im having invalid syntax `pd.concat([df, df.products` , Im also already in a pandas dataframe – SIMON_py Feb 24 '18 at 18:34
  • @SIMON_py can you post your whole line code? I cannot see where it went wrong. – Tai Feb 25 '18 at 03:19
  • Im trying also : `pvt = pd.pivot_table(ONLY_PHARMA.reset_index(), index=['index2'], columns='PRODUCT FAMILY', values='NET SALES VALUE SANOFI', fill_value='') \ .reset_index()` which returns in rows int and float for some reasons... I've create a new index in my dataframe but seems pandas dont see this unique Index for each rows – SIMON_py Feb 25 '18 at 04:41
  • Sorry, my answer does not involve `pivot`... @SIMON_py – Tai Feb 25 '18 at 04:43
  • `pd.concat([df, df.PRODUCT FAMILY.str.get_dummies().multiply(df["NET SALES VALUE "], axis="CUSTOMER ID")], axis=1)` Error Invalid Syntax – SIMON_py Feb 25 '18 at 04:53
  • replace `df.PRODUCT FAMILY` with `df["PRODUCT FAMILY"]` @SIMON_py – Tai Feb 25 '18 at 04:56
  • `pd.concat([df, df["PRODUCT FAMILY"].str.get_dummies().multiply(df["NET SALES VALUE"], axis="CUSTOMER ID")], axis=1) ` `ValueError: No axis named CUSTOMER ID for object type ` – SIMON_py Feb 25 '18 at 05:01
  • @SIMON_py just `axis="index"` Not `axis="CUSTOMER ID"` – Tai Feb 25 '18 at 05:09
0

Your question is unclear. In case of duplicate key, we usually aggregate values. Is that what you want ? Try this:

df.pivot_table(index='customer', columns='products', values ='Sales', aggfunc='sum')

products customer   a    b    c
   0         1    20.0  NaN  NaN
   1         2    NaN   20.0 NaN
   2         3    NaN   NaN 30.0
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • Hi Manish, If im using the aggfunc('sum') it will make the sum of customer ID 1 which i don't want. I want to keep all my rows – SIMON_py Feb 24 '18 at 17:21
  • Yes, I realized that later. @jezrael answer seems correct. – YOLO Feb 24 '18 at 17:32