How can I efficiently merge these dataframes on range values?

Question

I have two dataframes:

section_headers =
   start_sect_  end_sect_
0            0         50
1          121        139
2          221        270


sentences =
    start_sent_  end_sent_
0             0         50
1            56         76
2            77         85
3            88        111
4           114        120
5           121        139
6           221        270

I'm trying to merge sentences that belongs under each section_header...

A sentence belongs under a section_header when its start_sent_ is greater than or equal to that of a section_header's start_sect_ and less than or equal to the next section_header's start_sect_, etc.

Given this, my desired output is:

merge =
        start_sent_  end_sent_     start_sect_
    0             0         50               0
    1            56         76               0
    2            77         85               0
    3            88        111               0
    4           114        120               0
    5           121        139               121
    6           221        270               221

I initially converted this to a dictionary and then created a new dataframe based on the conditions, but the amount of data I'm dealing with was very large and it took forever to iterate through the records.

I'm trying to devise a way to not have to iterate through these records to do a merge of the data. I tried the broadcast method here Solution 2: Numpy Solution for large dataset, but since this method doesn't allow indexing of the arrays, it doesn't work. Otherwise, it works great for two other merge use cases I have.

Hmmm... it would be nice if whoever marked this down would let me know why. — horcle_buzz, May 25 '23 at 01:17
Probably because the logic is unclear and the provided example seem to have inconsistencies between the input and output. Quick question: are the intervals non overlapping? — mozway, May 25 '23 at 02:08
The logic was clear enough so that @rawson understood. I have corrected the desired output though. — horcle_buzz, May 25 '23 at 19:44

score 1 · Accepted Answer · answered May 25 '23 at 12:33

1

This looks like a use for merge_asof.

Using direction="backward", with section_headers as the right DF, the merge will be on the <= to row:

pd.merge_asof(sentences, section_headers["start_sect_"],
              left_on="start_sent_", right_on="start_sect_",
              direction="backward")

#Out[]: 
#   start_sent_  end_sent_  start_sect_
#0            0         50            0
#1           56         76            0
#2           77         85            0
#3           88        111            0
#4          114        120            0
#5          121        139          121
#6          221        270          221

answered May 25 '23 at 12:33

Rawson

2,637
1
5
14

The output is different to your question's example output, which has a few different numbers that do not appear in the initial DataFrames. – Rawson May 25 '23 at 13:03
Thanks, I corrected the output. This solution worked perfectly (I had come up with a nasty kludge involving a cross join, so this is awesome!). I had tried `merge_asof` unsuccessfully, but was not aware of the `direction` setting. – horcle_buzz May 25 '23 at 19:43

How can I efficiently merge these dataframes on range values?

1 Answers1