Lookup value in a range in order to merge or join with

Question

I was unable to make use of the examples there. I think perhaps because I have NaN values at either end. Here's what my dataframe looks like:

DF1 = pd.DataFrame([{'Conversion Value': 0, 'start': 0.0, 'end': np.nan},
 {'Conversion Value': 1, 'start': 0.0, 'end': 1.7},
 {'Conversion Value': 2, 'start': 1.7, 'end': 2.28},
 {'Conversion Value': 3, 'start': 2.28, 'end': 3.12},
 {'Conversion Value': 4, 'start': 3.12, 'end': 4.84},
 {'Conversion Value': 5, 'start': 4.84, 'end': 5.14},
 {'Conversion Value': 6, 'start': 5.14, 'end': 8.82},
 {'Conversion Value': 7, 'start': 8.82, 'end': 11.16},
 {'Conversion Value': 8, 'start': 11.16, 'end': 12.8},
 {'Conversion Value': 9, 'start': 12.8, 'end': 23.2},
 {'Conversion Value': 10, 'start': 23.2, 'end': 76.78},
 {'Conversion Value': 11, 'start': 76.78, 'end': 123.2},
 {'Conversion Value': 12, 'start': 123.2, 'end': 276.76},
 {'Conversion Value': 13, 'start': 276.76, 'end': 823.24},
 {'Conversion Value': 14, 'start': 823.24, 'end': np.nan}])

And DF2

DF2 = pd.DataFrame({'Value': [0, 2, 13]})

I want to join these, or create a new field on DF2 called 'bucket' so that If Value = 0, then I want 0 from DF1. If Value = 2 then I want 3 from DF1. If value = 13 then I want 9 from DF1. And so forth.

I tried to create an index on DF1 like so:

idx = pd.IntervalIndex.from_arrays(DF1['Start'], DF1['stop'])

"message": "missing values must be missing in the same location both left and right sides",

How can I create an index where the start and end values contain NaN on the end field?

no images. please post the data as a code for both DF1 and DF2 https://stackoverflow.com/help/minimal-reproducible-example — Naveed, Sep 28 '22 at 20:31
Hi @Naveed and sammywemmy, I have added reproducible DFs now. — Doug Fir, Sep 28 '22 at 21:08
are you mapping the values in DF2 to the conversion-value of DF1? if it is a mapping, then that needs to be defined, the criteria on how to match the two together — Naveed, Sep 28 '22 at 22:03
@Naveed on DF2 I want a new column 'Conversion Value' that comes from checking the Value field against the `start` and `end` fields in DF1. Where `DF2['Value']` is within a corresponding range (start and end), then get the corresponding value for `Conversion Value` for DF1 — Doug Fir, Sep 28 '22 at 22:10

score 3 · Accepted Answer · answered Sep 28 '22 at 22:41

You can use pd.merge_asof to perform this type of merge.

# Converted to float, because it didn't like when the dtypes weren't matching.
result = pd.merge_asof(df2.astype(float), df, left_on='Value', right_on='start')
print(result)

Output:

   Value  Conversion Value  start    end
0    0.0                 1    0.0   1.70
1    2.0                 2    1.7   2.28
2   13.0                 9   12.8  23.20

Naveed · Answer 2 · 2022-09-29T13:09:33.017

for the between join, pysql is a choice.


# https://pypi.org/project/pandasql/
pysqldf = lambda q: sqldf(q, globals())


qry = """
select *
from df1, df2
where df2.Value  between df1.start and df1.end
"""
pysqldf = lambda q: sqldf(q, globals())
result=pysqldf(qry)
result

   Conversion Value  start    end  Value
0                 1    0.0   1.70      0
1                 2    1.7   2.28      2
2                 9   12.8  23.20     13

sammywemmy · Answer 3 · 2022-09-29T02:34:46.450

One option is with conditional_join from pyjanitor, and is efficient for range joins as well:

# pip install pyjanitor
# you can also install the dev version for the latest
# including the ability to use numba for faster performance
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

import janitor
import pandas as pd
(DF1
.conditional_join(
     DF2.astype(float), 
    ('start', "Value", '<='),
    ('end', 'Value', '>='))
)

   Conversion Value  start    end  Value
0                 1    0.0   1.70    0.0
1                 2    1.7   2.28    2.0
2                 9   12.8  23.20   13.0

Lookup value in a range in order to merge or join with

3 Answers3