2

I have data in the shape as follows:


pd.DataFrame({'id': [1,2,3], 'item': ['item_a', 'item_a', 'item_b'],
                  'score': [1,-1,1]})

id  item    score
1   item_a    1
2   item_a    -1
3   item_b    1

I want to get dummy codes for the item column, but I want them scored based on their values in the score column. If there are no observations, I want a 0 imputed. Like so:

id  item_a  item_b
1      1      0
2     -1      0
3      0      1

As you see, I want to capture that user id 1 liked item_a, that id 2 disliked item_a, and that user 3 did not interact with item_a. The id column is not unique per row - for example, user id 3 could have liked item_a, and that would be recorded as a new row in the original dataframe.

I've tried using get_dummies in pandas, but that method only calculates the number of observed values in the "item" column, it doesn't take into account the score values.

Daniel
  • 363
  • 3
  • 11

2 Answers2

3

What you want, is a pivot_table:

df.pivot_table(values='score', index='id', columns='item', fill_value=0)
#item  item_a  item_b
#id                  
#1          1       0
#2         -1       0
#3          0       1
Jondiedoop
  • 3,303
  • 9
  • 24
1

You could use set_index + unstack:

import pandas as pd

df = pd.DataFrame({'id': [1,2,3], 'item': ['item_a', 'item_a', 'item_b'],
                  'score': [1,-1,1]})

print(df.set_index(['id', 'item'])['score'].unstack(fill_value=0))

Output

item  item_a  item_b
id                  
1          1       0
2         -1       0
3          0       1

If you want to exactly match your output just add this lines:

result.columns.name = None
result.reset_index(inplace=True)

Output

   id  item_a  item_b
0   1       1       0
1   2      -1       0
2   3       0       1
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76