1

Given the following two dataframes that represent ranges:

df1 =

  start   end
0   200   300
1   600   900
2   950  1050

df2 =

  start   end
0   350   550
1   650   800
2   900  1100

They can be represented as such:

df1  [200 300]            [600    900] [950 1050]
df2            [350  550]   [650 800] [900   1100]

I'm tasked with identifying four different types of relationships between df1 and df2 ranges:

  1. df2 subset of df1
    • df2 [650 800] subset of df1 [600 900]
  2. df2 superset of df1
    • df2 [900 1100] superset of df1 [950 1050]
  3. df2 after df1 (nearest neighbor, excluding subset/superset)
    • df2 [350 550] after df1 [200 300]
    • df2 [900 1100] after df1 [600 900]
  4. df2 before df1 (nearest neighbor, excluding subset/superset)
    • df2 [350 550] before df1 [600 900]
    • df2 [650 800] before df1 [950 1050]

I'm trying to use merge_asof() that I learned from this answer, but it's not working because of the complication added by the superset/subset relationship, e.g.:

# Create "before" condition
df_before = pd.merge_asof(
    df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
    df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
    on='before_end',
    direction='forward'
).query('end > before_end')

print(df_before)

Output:

  before_start  before_end  start    end
0          350         550  600.0  900.0
1          650         800  600.0  900.0

Target output:

  before_start  before_end  start     end
0          350         550  600.0   900.0
1          650         800  950.0  1050.0

The problem is that

pd.merge_asof(
    df2.rename(columns={col:f'before_{col}' for col in df2.columns}).sort_values('before_end'),
    df1.assign(before_end=lambda x: x['end']).sort_values('before_end'),
    on='before_end',
    direction='forward'
)

finds the closest df1.end after 800 in df2 [650 800], which is df1 [600 900]:

  before_start  before_end  start    end
0          350         550  600.0  900.0
1          650         800  600.0  900.0
2          900        1100    NaN    NaN

Is it possible to do a merge_asof() to find the nearest value based on a certain condition, such as "find nearest df1.end only if df1.start in that range is larger than 800 (950 in this case)"? With this level of complexity, maybe there is another function better suited to this task?

Notes:

  • Ranges in df1 can overlap each other, but are never identical.
  • Ranges in df2 can overlap each other, but are never identical.
  • df1 and df2 have over 200k rows each.
  • df1 and df2 have different number of rows.
  • Relationships are relative to df1, so only one match is needed for each row in df1, with up to four relationships in each row. Given the data provided above, the final output would look like this after merging back into df1:

df1 =

  start   end  subset_start  subset_end  superset_start  superset_end  before_start  before_end  after_start  after_end
0   200   300           NaN         NaN             NaN           NaN           NaN         NaN        350.0      550.0
1   600   900         650.0       800.0             NaN           NaN         350.0       550.0        900.0     1100.0
2   950  1050           NaN         NaN           900.0        1100.0         650.0       800.0          NaN        NaN
thdoan
  • 18,421
  • 1
  • 62
  • 57
  • Is it the case that each of `df1` and `df2` will have no overlapping ranges within them? This looks like the case, with the ranges non-overlapping within each, but it affects the solution e.g. if `df1` can have many overlapping ranges like `[0, 10)`, `[0, 20)`, etc. Also, how do you want to handle multiple matches? The same range in `df2` could be a nearest neighbor before one row in `df1` but also the nearest after another row. The reverse is also true. Do you only need one match per row in `df1`? Or per row in `df2`? – mcskinner Apr 13 '20 at 02:03
  • @mcskinner thanks for the feedback. I will update the question to clarify. – thdoan Apr 13 '20 at 04:07

1 Answers1

1

It is possible to use pd.merge_asof to find the before and after options.

before_df = pd.merge_asof(df1, df2, left_on='start', right_on='end', suffixes=['', '_before'])
before_df
#    start   end  start_before  end_before
# 0    200   300           NaN         NaN
# 1    600   900         350.0       550.0
# 2    950  1050         650.0       800.0

after_df = pd.merge_asof(df2, df1, left_on='start', right_on='end', suffixes=['_after', ''])
#    start_after  end_after  start  end
# 0          350        550    200  300
# 1          650        800    200  300
# 2          900       1100    600  900

But it is not easy to make it work or the subset and superset computations. For those, I would reach for this algorithm that can work in one pass.

def range_intersect(lh_ranges, rh_ranges): 
    all_ranges = sorted(
        [(b, e, 'lh') for b, e in lh_ranges] +
        [(b, e, 'rh') for b, e in rh_ranges]
    ) 

    res = [] 
    max_b, max_e = None, None 
    for b, e, which in all_ranges: 
        if which == 'rh': 
            if max_e is None or e > max_e: 
                max_b, max_e = b, e 
        elif max_e is not None and e <= max_e: 
            res.append((b, e, max_b, max_e)) 

    return res

This finds the elements of lh that are subsets of elements in rh. To find supersets, it can be run in reverse. For simplicity it takes a list of ranges instead of DataFrames. The conversion is straightforward.

lh = df1.to_dict('split')['data']
rh = df2.to_dict('split')['data']

lh
# [[200, 300], [600, 900], [950, 1050]]

rh                                                                                                                                                                                                                                  
# [[350, 550], [650, 800], [900, 1100]]

After that, the result DataFrame you want is just a few merges away.

# Compute supersets, then run in reverse to get the subsets.
superset_df = pd.DataFrame(range_intersect(lh, rh), columns=['start', 'end', 'start_superset', 'end_superset'])
subset_df = pd.DataFrame(range_intersect(rh, lh), columns=['start_subset', 'end_subset', 'start', 'end'])

# Merge all the results together.
result = df1.merge(subset_df, how='left').merge(superset_df, how='left').merge(before_df, how='left').merge(after_df, how='left')

# The reversed operations, after and subset, can have many matches in df1.
result.drop_duplicates(['start', 'end'])
#    start   end  start_subset  end_subset  start_superset  end_superset  start_before  end_before  start_after  end_after
# 0    200   300           NaN         NaN             NaN           NaN           NaN         NaN        350.0      550.0
# 2    600   900         650.0       800.0             NaN           NaN         350.0       550.0        900.0     1100.0
# 3    950  1050           NaN         NaN           900.0        1100.0         650.0       800.0          NaN        NaN
mcskinner
  • 2,620
  • 1
  • 11
  • 21
  • Thanks so much for this. I am still wrapping my head around your solution, and once I understand exactly what it's doing I will accept your answer :). – thdoan Apr 18 '20 at 19:45
  • No problem. Let me know if there is anything I can do to help :) – mcskinner Apr 18 '20 at 20:03
  • 1
    The key insight for `range_intersect` is that, if you sort by range starts, you can ignore the end of every candidate superset range, `rh`, except the largest you have seen. Because the starts are sorted, you've processed the superset range, and are now processing a subset candidate from `lh`, you're guaranteed that the subset does not start too early. So just check if the candidate subset also ends before the at least one of the superset ranges, which is the same as checking just the largest endpoint. So you just need to track that largest endpoint, and see if the subset ends first. – mcskinner Apr 18 '20 at 20:10
  • Yes, I can see that after breaking down your function and adding `print()` -- very clever! After adding `print()` everywhere, I can make sense of the logic now. The key takeaway for me also was that I was not using `merge_asof()` to its full potential. – thdoan Apr 18 '20 at 21:34
  • mcskinner, when I plugged in the real data, the solution didn't work because both `df1` and `df2` have other columns (some shared between them, some not), so `range_intersect()` returned a "too many values to unpack" error. In terms of efficiency when processing over 200k rows, do you recommend modifying `range_intersect()` to take into account additional columns (lots of them), or save the dataframe into a separate dataframe with just `start` and `end` columns and merge them back at the end? – thdoan Apr 19 '20 at 19:46
  • 1
    I like the second approach. You can unpack like this `lh = df1[['start', 'end']].to_dict('split')['data']` and similarly for `rh`. You shouldn't need to change anything else, since the final merge on `df1` should add all the columns you excluded for `range_intersect`. – mcskinner Apr 19 '20 at 19:49
  • 1
    Thanks. I think second approach is less headache the more I dig into this, especially when there are like 10 additional info columns in `df2` that are not used in calculations, but should be displayed in final output (e.g., code_subset, category_subset). UPDATE: I just noticed `pd.merge_asof()` only adds suffixes to the keys (`start` and `end`), not the additional columns. I think I will try saving a copy of dataframe with just `start` and `end` columns, then merge them back at the end with `df2.rename(columns={col:f'{col}_subset' for col in df2.columns})`, etc. – thdoan Apr 19 '20 at 20:01
  • 1
    I was able to rename all non-key columns directly in `pd.merge_asof()` by, in "before" condition, replacing `df2` with `df2.rename(columns=lambda x: x+'_before' if x not in ['start', 'end'] else x)`. I also found someone doing comprehension directly in `rename()` (https://stackoverflow.com/a/58143182/452587), so I'll have to time the different methods to see which is more efficient over lots of rows. – thdoan Apr 19 '20 at 20:46
  • There are at least five different ways to rename specific columns in pandas ^^. I have compiled them at https://stackoverflow.com/a/61312706/452587 – thdoan Apr 19 '20 at 22:56
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212042/discussion-between-thdoan-and-mcskinner). – thdoan Apr 20 '20 at 02:23