Best way to join / merge by range in pandas

Question

I'm frequently using pandas for merge (join) by using a range condition.

For instance if there are 2 dataframes:

A (A_id, A_value)

B (B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

Using SQL syntax that would be:

SELECT *
FROM A,B
WHERE A_value between B_low and B_high

and that would be really easy, short and efficient.

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

Is there a more elegant and/or efficient way to perform this action?

[This Q&A](https://stackoverflow.com/questions/15581829/how-to-perform-an-inner-or-outer-join-of-dataframes-with-pandas-on-non-simplisti) might be relevant. — Andras Deak -- Слава Україні, Jun 05 '17 at 12:36
looks like they use a similar method to the one I suggested myself (dummy columns, cartesian product and mask filter). It's surprising that there is no built-in solution. — Dimgold, Jun 05 '17 at 12:44
Did you also look at the accepted answer...? Never learn from _questions_ on Stack Overflow. Although it might be that I'm not realizing that the answer does the same thing, in which case sorry:) — Andras Deak -- Слава Україні, Jun 05 '17 at 12:51
looks like a more memory-optimized solution but yet - not elegant at all :( Thanks! — Dimgold, Jun 05 '17 at 12:53
Well, yeah, the fact that `merge` doesn't accept any function arguments suggests that an elegant solution might not be available. But I barely know pandas, so let's hope an expert comes along and proves the opposite:) — Andras Deak -- Слава Україні, Jun 05 '17 at 12:57
FWIW I believe your sql statement is conceptually doing the same thing as your stated pandas solution, just much more efficiently. — J. Doe, Nov 16 '20 at 09:49

score 60 · Accepted Answer · edited Mar 02 '22 at 14:21

60

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.concat([
    A.loc[i, :].reset_index(drop=True),
    B.loc[j, :].reset_index(drop=True)
], axis=1)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([
    A.loc[i, :].reset_index(drop=True),
    B.loc[j, :].reset_index(drop=True)
], axis=1).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

edited Mar 02 '22 at 14:21

Joe

2,994
5
31
34

answered Jun 17 '17 at 05:28

piRSquared

285,575
57
475
624

2

amazing solution.. can we say that this is a cross join... If I wanted to keep all rows of `A` only(basically left join on `A`) then what change would I need to make ? – joel.wilson Jan 31 '18 at 19:41
I wanted to reduce the bursting of rows thats happening too. Any thoughts? – joel.wilson Jan 31 '18 at 19:59
This is great, I was have the same question as Joel, would it be possible to keep all the values of table A, like a left join? – brandog Jun 26 '18 at 03:38
@piRSquared how would you do to keep only rows where A_id == B_id ? We can do it afterwards but I don't think that's the most efficient. In my case, I have an original df of 79k rows, it goes after your operation to 2.3m rows, then when I keep only rows where A_id == B_id, I have 74k rows which is what I expect. Can't this be done all at once ? – yeye Jun 15 '19 at 08:04
Found my way: a = A.A_value.values aId = A.A_id.values bId = B.B_id.values bh = B.B_high.values bl = B.B_low.values i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh) & (aId[:, None] == bId) – yeye Jun 15 '19 at 09:08
I'm trying to figure out how to join an aggregate of the values for A joined to B: Example: in the last dataframe shown I would like to get the average of A_value for line 2 and 3 so that only one record remains for B_id=2 unfortunately my pandas/numpy knowledge lacks here – Fenrir Aug 15 '19 at 09:04
The more accurate description of a SQL join with ALL of A and matching Bs (a_val BETWEEN b_low AND b_high), like in the second example is a LEFT OUTER JOIN. LEFT JOIN with neither INNER (like the first example) nor OUTER specified, defaults to INNER. – hjonez Oct 04 '19 at 00:50
hello @piRSquared, in addition to joining on overlapping values, how can I add an extra condition that each overlapping values must also have the same ID value. Something similar to the SQL equivalent of ```join A B on A.value BETWEEN B.low and B.high AND A.id = B.id``` – lollerskates Dec 06 '19 at 19:24
2

To very big datasets with may rows and columns, I got a memory overflow with this answer. =/ – xicocaio Aug 28 '20 at 19:38
1

@xicocaio I edited the answer to use pandas methods instead of numpy in the generation of the final DataFrame. This works much faster and consumes less memory, so maybe this will solve your overflow issue – Joe Mar 02 '22 at 14:23
@joel.wilson, great I will test it. Thanks! – xicocaio Mar 02 '22 at 18:26
1

You should use `.iloc` instead of `.loc`, since the index received from numpy is always referring to `iloc`. `loc` can lead to unintended behavior – Christian Jul 10 '22 at 16:20

score 12 · Answer 2 · edited Feb 22 '19 at 20:00

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

score 2 · Answer 3 · answered Jun 17 '17 at 08:52

I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

sammywemmy · Answer 4 · 2021-12-21T13:42:24.143

conditional_join from pyjanitor may be helpful in the abstraction/convenience;:

# pip install pyjanitor
import pandas as pd
import janitor

inner join

A.conditional_join(B, 
                   ('A_value', 'B_low', '>='), 
                   ('A_value', 'B_high', '<=')
                  )

   A_id  A_value  B_id  B_low  B_high
0     0        5     0      0      10
1     3       35     1     30      40
2     3       35     2     30      50
3     4       45     2     30      50

left join

A.conditional_join(
       B, 
       ('A_value', 'B_low', '>='), 
       ('A_value', 'B_high', '<='), 
       how = 'left'
    )

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      1       15   NaN    NaN     NaN
2      2       25   NaN    NaN     NaN
3      3       35   1.0   30.0    40.0
4      3       35   2.0   30.0    50.0
5      4       45   2.0   30.0    50.0
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN

I get the following error: `Cannot interpret ' [ 5, 15, 25, 35, 35, 35, 45, 45, 45, 55, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75] Length: 21, dtype: int64' as a data type` Any ideas? — Quixotic22, Nov 02 '21 at 14:56
Hi @Quixotic22, a new version of `pyjanitor` was recently released; you should not have that error anymore. You can leave an issue on the GitHub page if you encounter any issues. — sammywemmy, Dec 02 '21 at 03:27

score 1 · Answer 5 · answered Jun 14 '17 at 16:48

lets take a simple example:

df=pd.DataFrame([2,3,4,5,6],columns=['A'])

returns

now lets define a second dataframe

df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]

results in

    B_low   B_high
0   1       2
1   6       8
2   2       4
3   3       6
4   5       6

here we go; and we want output to be index 3 and A value 5

df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()

results in

    A
3   5.0

that's not a join, just stacking – Dimgold Jun 15 '17 at 14:33 — Dimgold, Jun 15 '17 at 14:33

score 1 · Answer 6 · answered Oct 03 '21 at 12:57

I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.

In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:

df_left = pd.DataFrame({
    "time_from": [1, 4, 10, 21],
    "time_to": [3, 7, 15, 27]
})

df_right = pd.DataFrame({
    "time": [2, 6, 16, 25]
})

df_left

  time_from time_to
0        1       3
1        4       7
2       10      15
3       21      27


df_right

  time
0    2
1    6
2   16
3   25

First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:

merged = pd.merge_asof(
    left=df_1,
    right=df_2.rename(columns={"time": "candidate_match_1"}),
    left_on="time_from",
    right_on="candidate_match_1",
    direction="forward"
)

merged

  time_from time_to candidate_match_1
0        1       3                 2
1        4       7                 6
2        10      15               16
3        21      27               25

As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.

Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:

merged = pd.merge_asof(
    left=merged,
    right=df_2.rename(columns={"time": "candidate_match_2"}),
    left_on="time_to",
    right_on="candidate_match_2",
    direction="backward"
)

merged

  time_from time_to candidate_match_1   candidate_match_2
0        1        3                2                   2
1        4        7                6                   6
2        10      15               16                   6
3        21      27               25                  25

Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:

merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
    merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]

merged

  time_from time_to candidate_match_1   candidate_match_2   match
0        1       3                 2                   2       2
1        4       7                 6                   6       6
2        10     15                16                   6    None
3        21     27                25                   25     25

Best way to join / merge by range in pandas

6 Answers6

Linked

Related