Given the following two dataframes that represent ranges:
df1 =
start end
0 200 300
1 600 900
2 950 1050
df2 =
start end
0 350 550
1 650 800
2 900 1100
They can be represented as such:
df1 [200 300] [600 900] [950 1050]
df2 [350 550] [650 800] [900 1100]
I'm tasked with identifying four different types of relationships between df1
and df2
ranges:
df2
subset ofdf1
df2 [650 800]
subset ofdf1 [600 900]
df2
superset ofdf1
df2 [900 1100]
superset ofdf1 [950 1050]
df2
afterdf1
(nearest neighbor, excluding subset/superset)df2 [350 550]
afterdf1 [200 300]
df2 [900 1100]
afterdf1 [600 900]
df2
beforedf1
(nearest neighbor, excluding subset/superset)df2 [350 550]
beforedf1 [600 900]
df2 [650 800]
beforedf1 [950 1050]
I'm trying to use merge_asof()
that I learned from this answer, but it's not working because of the complication added by the superset/subset relationship, e.g.:
# Create "before" condition
df_before = pd.merge_asof(
df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
on='before_end',
direction='forward'
).query('end > before_end')
print(df_before)
Output:
before_start before_end start end
0 350 550 600.0 900.0
1 650 800 600.0 900.0
Target output:
before_start before_end start end
0 350 550 600.0 900.0
1 650 800 950.0 1050.0
The problem is that
pd.merge_asof(
df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
on='before_end',
direction='forward'
)
finds the closest df1.end
after 800 in df2 [650 800]
, which is df1 [600 900]
:
before_start before_end start end
0 350 550 600.0 900.0
1 650 800 600.0 900.0
2 900 1100 NaN NaN
Is it possible to do a merge_asof()
to find the nearest value based on a certain condition, such as "find nearest df1.end
only if df1.start
in that range is larger than 800 (950 in this case)"? With this level of complexity, maybe there is another function better suited to this task?
Notes:
- Ranges in
df1
can overlap each other, but are never identical. - Ranges in
df2
can overlap each other, but are never identical. df1
anddf2
have over 200k rows each.df1
anddf2
have different number of rows.- Relationships are relative to
df1
, so only one match is needed for each row indf1
, with up to four relationships in each row. Given the data provided above, the final output would look like this after merging back intodf1
:
df1 =
start end subset_start subset_end superset_start superset_end before_start before_end after_start after_end
0 200 300 NaN NaN NaN NaN NaN NaN 350.0 550.0
1 600 900 650.0 800.0 NaN NaN 350.0 550.0 900.0 1100.0
2 950 1050 NaN NaN 900.0 1100.0 650.0 800.0 NaN NaN