How to join two DataFrames with multiple overlapping timestamps using an extra shared variable

Question

This question builds on the one asked here: How to join two dataframes for which column values are within a certain range? and answered by @coldspeed . Below is the DataFrame modified for my problem:

print df_1

  timestamp              A          B       User
0 2016-05-14 10:00    0.020228   0.026572    1
1 2016-05-14 10:00    0.057780   0.175499    2
2 2016-05-14 10:00    0.098808   0.620986    3
3 2016-05-14 10:15    0.158789   1.014819    1
4 2016-05-14 10:15    0.038129   2.384590    2
5 2016-05-14 10:15    0.038129   2.384590    3

print df_2

  start                end                  event   User  
0 2016-05-14 10:00     2016-05-14 10:54:33  E1       1        
1 2016-05-14 10:00     2016-05-14 10:54:37  E2       2
2 2016-05-14 10:00     2016-05-14 10:54:42  E3       3

desired output:

  timestamp              A          B       User  event
0 2016-05-14 10:00    0.020228   0.026572    1     E1
1 2016-05-14 10:00    0.057780   0.175499    2     E2
2 2016-05-14 10:00    0.098808   0.620986    3     E3
3 2016-05-14 10:15    0.158789   1.014819    1     E1
4 2016-05-14 10:15    0.038129   2.384590    2     E2
5 2016-05-14 10:15    0.038129   2.384590    3     E3

So, I believe I can use as a base:

idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.loc[idx.get_indexer(df_1.timestamp), 'event']
df_1['event'] = event.values

But I need a way to reference the UserID to prevent mixing up the sessions that overlap.

The dataset is huge and the same user participates in many sessions, so I don't think Join would work here? — thepez87, Oct 31 '18 at 14:03

score 0 · Answer 1 · answered Oct 31 '18 at 14:05

0

In that case you can using merge_asof

pd.merge_asof(df1,df2,left_on='timestamp',right_on='end',by='User',direction ='forward')
Out[148]: 
            timestamp         A  ...                   end  event
0 2016-05-14 10:00:00  0.020228  ...   2016-05-14 10:54:33     E1
1 2016-05-14 10:00:00  0.057780  ...   2016-05-14 10:54:37     E2
2 2016-05-14 10:00:00  0.098808  ...   2016-05-14 10:54:42     E3
3 2016-05-14 10:15:00  0.158789  ...   2016-05-14 10:54:33     E1
4 2016-05-14 10:15:00  0.038129  ...   2016-05-14 10:54:37     E2
5 2016-05-14 10:15:00  0.038129  ...   2016-05-14 10:54:42     E3

answered Oct 31 '18 at 14:05

BENY

317,841
20
164
234

Ok, so this works well if every line in df2 has corresponding timestamps. I am finding now in my df2 that there are extra sessions, i.e. ones that do not correspond to any timestamps from df1. Is there a work around for this? – thepez87 Oct 31 '18 at 14:49
@thepez87 then it will be O(m.n) solution – BENY Oct 31 '18 at 14:52
@thepez87 df=df1.merge(df2,how='left');df.loc[(df.timestamp>df.start)&(df.timestamp – BENY Oct 31 '18 at 14:53
Sorry what is O(m.n) ? Can you explain your answer a bit more please? And what happened to merging by User? – thepez87 Oct 31 '18 at 14:58

How to join two DataFrames with multiple overlapping timestamps using an extra shared variable

1 Answers1

Linked