0

I have the two dataframes that have one column in common: users & apps. Each user can have multiple apps and each app will have name and installation date.

After joining both dataframes I get:

user_id ...  app_name  ins_date
user1   ...   app1      1/1/2020
user1   ...   app2      1/7/2020
user2   ...   NULL      NULL
user3   ...   app1      1/5/2020
user4   ...   app1      1/3/2020

I want to avoid repeating users but without losing the app details. Is it possible to insert the app's df inside the user's df?

Something like:

user_id ...  app
user1   ...   {'app_name': ['app1', 'app2'], 'ins_date': ['1/1/2020', '1/7/2020']}
user2   ...   NULL      
user3   ...   {'app_name': ['app1'], 'ins_date': ['1/5/2020']}
user4   ...   {'app_name': ['app1'], 'ins_date': ['1/3/2020']}  

I tried DataFrameGroupBy.apply() but I can only make a list with one of the columns.

Is there another way?

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • Please include the raw data for each dataframe and the code used to generate your combined dataframe. Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/how-do-i-provide-a-reproducible-copy-of-my-existing-dataframe) – Trenton McKinney Oct 04 '19 at 01:50

1 Answers1

0

You can try this

df.groupby('user_id').apply(lambda x: {'app_name': x['app_name'].tolist(), 'ins_date': x['ins_date'].tolist()})

output

user_id
user1    {'app_name': ['app1', 'app2'], 'ins_date': ['1/1/2020', '1/7/2020']}
user2    {'app_name': [nan], 'ins_date': [nan]}                              
user3    {'app_name': ['app1'], 'ins_date': ['1/5/2020']}                    
user4    {'app_name': ['app1'], 'ins_date': ['1/3/2020']}                    
dtype: object

Update: used pd.set_option('display.max_colwidth', -1) to display max col width