2

let say i have a dataframe that consist of these two columns.

User_id hotel_cluster 
   1     0
   2     2
   3     2
   3     3 
   3     0
   4     2

i want to change it into something like this. Do i need to write a function or is there a pandas way to do it?

User_id hotel_cluster_0 hotel_cluster_1 hotel_cluster_2 hotel_cluster_3
  1          1                  0             0              0
  2          0                  0             1              0
  3          1                  0             1              1
  4          0                  0             1              0

Please help! Sorry if i am not posting the question in the right format Thank you!

yatu
  • 86,083
  • 12
  • 84
  • 139
Simon Lim
  • 55
  • 3

2 Answers2

2

SEE ALSO


IIUC:

Option 1

First change 'hotel_cluster' to a categorical that includes categories that don't exist

col = 'hotel_cluster'
df[col] = pd.Categorical(df[col], categories=[0, 1, 2, 3])
pd.crosstab(*map(df.get, df)).add_prefix(f"{col}_")

hotel_cluster  hotel_cluster_0  hotel_cluster_1  hotel_cluster_2  hotel_cluster_3
User_id                                                                          
1                            1                0                0                0
2                            0                0                1                0
3                            1                0                1                1
4                            0                0                1                0

Option 2

Reindex after crosstab

pd.crosstab(*map(df.get, df)).reindex(
    columns=range(4), fill_value=0
).add_prefix('hotel_cluster_')

hotel_cluster  hotel_cluster_0  hotel_cluster_1  hotel_cluster_2  hotel_cluster_3
User_id                                                                          
1                            1                0                0                0
2                            0                0                1                0
3                            1                0                1                1
4                            0                0                1                0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

A simple way if you do not need the non-appearing values is to use pd.get_dummies:

pd.get_dummies(df.hotel_cluster, prefix = 'hotel_cluster').set_index(df.User_id)

Otherwise you want something like @piRSquared's solution.

yatu
  • 86,083
  • 12
  • 84
  • 139