0

Let say I have a dataframe df1 with columns are like [user_id: string, app_name: string, frequency: float]. Now I want to get a df2 matrix representation Users x Apps, where each entry indicates the frequency. In other words, df2 would have columns that look like [user_id, frequency_app_1, frequency_app_2, ... frequency_app_n] Also, how could I handle the data when n<1000 and n>10000?

Since I don't know the keyword for this problem, a clue would also be helpful.

Let say dataset is like below

  user_id app_name  frequency
    id001   app4   324.54
    id005   app4   786.56
    id002   app3   87.79
    id002   app4   423.76
    id003   app1   23.34
    id001   app2   34.76
    id001   app1   796.89    
    id004   app6   324.12
    id002   app1   32.34
    id004   app3   124.98
    id003   app1   324.98
    id003   app1   434.5
    id001   app1   89.54
    id001   app4   54.51
    id002   app2   544.87
    id003   app3   76.0
    id002   app1   32.32
    id001   app1   654.56

The expected outcome should look like

user_id   frequency_app_1   frequency_app_2   frequency_app_3   frequency_app_4   
id001     1629.1                 34.76              N/A                 379.05
...

1 Answers1

0

I think your example has a little problem. there are some columns has the same "user_id" and "app_name", but different in "frequency". So you can't use pivot directly, you can drop_duplicates and then use pivot.

df2 = df.drop_duplicates(subset=["user_id", "app_name"])
df2.pivot(index="user_id", columns="app_name", values="frequency")
# Output
app_name    app1    app2    app3    app4    app6
user_id                 
id001   796.89  34.76   NaN 324.54  NaN
id002   32.34   544.87  87.79   423.76  NaN
id003   23.34   NaN 76.00   NaN NaN
id004   NaN NaN 124.98  NaN 324.12
id005   NaN NaN NaN 786.56  NaN

But make sure what columns you want if the "user_id" and "app_name" is the same.

Hsgao
  • 553
  • 5
  • 18