0

I have two dataframes. First df_1 one is with data states with following format:

session_id | user_id | timestamp | state1 | state2

And the second one is with another dataframe df_2 states with format:

session_id | user_id | timestamp | state3 | state4

I want to join them into final dataframe with following structure:

session_id | user_id | timestamp | state1 | state2 | state3 | state4

I want to keep rows from both frames not just overlaying. Something like this:

 session_id  user_id  timestamp  state1  state2  state3  state4 
0         1        1      14:00       1       1       0       1    
1         1        2      14:00     NAN     NAN       0       1
2         1        3      14:00       1       0     NAN     NAN

Basically I think it should be a outer join. So I studied documentation and come up with this:

df_1.set_index(['session_id', 'user_id', 'timestamp']).join(
   df_2.set_index(['session_id', 'user_id', 'timestamp']), how="outer")

But it is not working right way. It just puts dataframes one under another. What should I do ?

Michal Hucko
  • 357
  • 5
  • 18

1 Answers1

0

You can simply perform an outer merge, as below.

By default, pd.merge works on the intersection of columns.

Setup

df1 = pd.DataFrame({'session_id': [1, 1],
                    'state1': ['1', '1'],
                    'state2': ['1', '0'],
                    'timestamp': ['14:00', '14:00'],
                    'user_id': [1, 3]})

df2 = pd.DataFrame({'session_id': [1],
                    'state3': ['0'],
                    'state4': ['1'],
                    'timestamp': ['14:00'],
                    'user_id': [2]})

Solution

res = pd.merge(df1, df2, how='outer')

Result

   session_id  user_id timestamp state1 state2 state3 state4
0           1        1     14:00      1      1    NaN    NaN
1           1        3     14:00      1      0    NaN    NaN
2           1        2     14:00    NaN    NaN      0      1
jpp
  • 159,742
  • 34
  • 281
  • 339