1

I want to transform a one-hot-encoded dataframe into a basket sparse matrix.

I have this:

df
Ticket Number  Water  Orange  Lemon  Strawb.  Peach  Book  Pen
5001           0      0      0     0       1     1    0
5002           1      1      0     0       1     1    0
5003           1      0      0     0       0     0    0

I want this:

df
Ticket Number 
5001           Peach, Book
5002           Water, Orange, Peach, Book
5003           Water

I have tried some of the ideas here:

Pivoting a One-Hot-Encode Dataframe

But I wasn't able to come up with a solution myself.

Some help would be very much appreciated. Thanks

This is how the dataframe actually looks like

Marc
  • 129
  • 1
  • 8

2 Answers2

2

You can use DataFrame.dot after setting "Ticket Number" to be the index:

u = df.set_index('Ticket Number')
u.dot(u.columns+',').str.rstrip(',')

Ticket Number
5001                 Peach,Book
5002    Water,Orange,Peach,Book
5003                      Water
dtype: object

Or,

u.dot(u.columns+',').str[:-1].reset_index(name='Items')

   Ticket Number                    Items
0           5001               Peach,Book
1           5002  Water,Orange,Peach,Book
2           5003                    Water

A slightly more robust version of the same thing:

u = df.set_index('Ticket Number').select_dtypes([np.number])
u = u.fillna(0, downcast='infer').clip(lower=0, upper=1)
u.dot(u.columns+',').str[:-1]

Ticket Number
5001                 Peach,Book
5002    Water,Orange,Peach,Book
5003                      Water
dtype: object
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 2
    `df.dot(df.columns+',').str[:-1] ` – BENY May 23 '19 at 14:48
  • For some reason i thought `.dot()` only worked with `bool` :( – anky May 23 '19 at 14:52
  • 1
    @anky_91 It's work as long as its 0/1 in any form. Dot will "multiply" the string with the number. s * 0 is '' and s * 1 is s. That's the trick here. – cs95 May 23 '19 at 14:53
  • Hi @WeNYoBen thanks for your effort. I am encountering this error: ` TypeError: can't multiply sequence by non-int of type 'float'` – Marc May 23 '19 at 15:17
  • @Marc After `u = df.set_index('Ticket Number')`, do `u = u.fillna(0, downcast='infer')` and try the `dot` step again. – cs95 May 23 '19 at 15:19
  • Hi @cs95 thank you very much for your help. I am still getting the same error. I don't understand why it isn't working! – Marc May 23 '19 at 15:28
  • @Marc Me neither... unless you can show me some data sample where this code is failing :) – cs95 May 23 '19 at 15:29
  • @cs95 How can I share that with you? – Marc May 23 '19 at 15:30
  • @Marc Just [modify your post](https://stackoverflow.com/posts/56277750/edit) to include 2-3 more rows with new data. Make sure you can reproduce the issue on the sample provided. Thanks! – cs95 May 23 '19 at 15:31
  • @cs95 As suggested, I have included a screenshot of the actual dataframe before doing any modifications suggested here. I hope you can help and thanks for your time :) – Marc May 23 '19 at 15:34
  • @Marc Umm, okay, I've edited my answer and thrown the kitchen sink at it. If it doesn't work, don't forget to save your file. It'll work after that. – cs95 May 23 '19 at 15:38
  • 1
    @cs95 Thanks for your time and help. I had some items that had values greater than 1, hence it didn't work. I appreciate your time and effort! Thanks a lot – Marc May 23 '19 at 15:54
2

A long way

df.melt('TicketNumber').loc[lambda x :x['value']==1].groupby('TicketNumber').variable.agg(','.join)
Out[746]: 
TicketNumber
5001                 Peach,Book
5002    Water,Orange,Peach,Book
5003                      Water
Name: variable, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234