2

For my current project I need to merge some dataframes based on time intervals. There are of course plenty of answers for this question, for example, I used answer from this question previously and it works great, but it involves double for-cycles.

This current issues is this. I have a series of ~500 dataframes similar to df_1 here:

                  time     value
42 2020-07-01 00:57:16  0.356310
43 2020-07-01 01:00:20  0.472270
44 2020-07-01 03:33:40  0.356310
45 2020-07-01 03:45:37  0.356310
46 2020-07-01 03:57:51  0.425053
47 2020-07-01 05:29:20  0.340471

and they need to have event column joined from df_2:

                 start                 end  event
49 2020-07-01 03:07:16 2020-07-01 03:09:51    1.0
50 2020-07-01 03:09:51 2020-07-01 03:12:30    1.0
54 2020-07-01 03:21:21 2020-07-01 03:24:46    1.0
58 2020-07-01 03:33:40 2020-07-01 03:36:43    1.0
59 2020-07-01 03:36:43 2020-07-01 03:39:33    1.0
60 2020-07-01 03:39:33 2020-07-01 03:42:57    1.0

in case df_1.time falls between df_2.start and df_2.end so that in the end I can get:

                  time     value      event
42 2020-07-01 00:57:16  0.356310      NaN
43 2020-07-01 01:00:20  0.472270      NaN
44 2020-07-01 03:33:40  0.356310      1.0
45 2020-07-01 03:45:37  0.356310      NaN
46 2020-07-01 03:57:51  0.425053      NaN
47 2020-07-01 05:29:20  0.340471      NaN

The problem is that as I said, I have ~500 dataframes similar to df_1 with over 10k records each and df_2 itself is ~40000 rows long. So looping over all of the rows is out of the question. It will take and exorbitant amount of time.

But for the life of me I can't find any vectorized method in Pandas that can be used for that purpose. Any ideas?

NotAName
  • 3,821
  • 2
  • 29
  • 44

2 Answers2

2

This is a good case for merge_asof, which only works if all start, end intervals are non-overlapping.

# match the `time` to the largest `start` earlier than it
tmp = pd.merge_asof(df1,df2, left_on='time',right_on='start')

# check if `time` <= `end`
df1['event'] = tmp['event'].where(tmp['end']>=tmp['time']).values

Output:

                  time     value  event
42 2020-07-01 00:57:16  0.356310    NaN
43 2020-07-01 01:00:20  0.472270    NaN
44 2020-07-01 03:33:40  0.356310    1.0
45 2020-07-01 03:45:37  0.356310    NaN
46 2020-07-01 03:57:51  0.425053    NaN
47 2020-07-01 05:29:20  0.340471    NaN

For the general case, only thing I can think of is cross-merge and filter:

(df1.assign(key=1)
    .merge(df2.assign(key=1), on='key')
    .query('start <= time <= end')
)

which should be doable with your data sizes, albeit a bit slow.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

One option is with the conditional_join from pyjanitor, and also covers overlapping intervals:

# pip install pyjanitor
import pandas as pd
import janitor

(
df1
.conditional_join(
    df2, 
    ('time', 'start', '>='), 
    ('time', 'end', '<='), 
    how = 'left')
.loc[:, ['time', 'value', 'event']]
)

                 time     value  event
0 2020-07-01 00:57:16  0.356310    NaN
1 2020-07-01 01:00:20  0.472270    NaN
2 2020-07-01 03:33:40  0.356310    1.0
3 2020-07-01 03:45:37  0.356310    NaN
4 2020-07-01 03:57:51  0.425053    NaN
5 2020-07-01 05:29:20  0.340471    NaN

sammywemmy
  • 27,093
  • 4
  • 17
  • 31