46

I'm frequently using pandas for merge (join) by using a range condition.

For instance if there are 2 dataframes:

A (A_id, A_value)

B (B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

Using SQL syntax that would be:

SELECT *
FROM A,B
WHERE A_value between B_low and B_high

and that would be really easy, short and efficient.

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

Is there a more elegant and/or efficient way to perform this action?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Dimgold
  • 2,748
  • 5
  • 26
  • 49

6 Answers6

60

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(
        A_id=range(10),
        A_value=range(5, 105, 10)
    ))
B = pd.DataFrame(dict(
        B_id=range(5),
        B_low=[0, 30, 30, 46, 84],
        B_high=[10, 40, 50, 54, 84]
    ))

A

   A_id  A_value
0     0        5
1     1       15
2     2       25
3     3       35
4     4       45
5     5       55
6     6       65
7     7       75
8     8       85
9     9       95

B

   B_high  B_id  B_low
0      10     0      0
1      40     1     30
2      50     2     30
3      54     3     46
4      84     4     84

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.concat([
    A.loc[i, :].reset_index(drop=True),
    B.loc[j, :].reset_index(drop=True)
], axis=1)

   A_id  A_value  B_high  B_id  B_low
0     0        5      10     0      0
1     3       35      40     1     30
2     3       35      50     2     30
3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([
    A.loc[i, :].reset_index(drop=True),
    B.loc[j, :].reset_index(drop=True)
], axis=1).append(
    A[~np.in1d(np.arange(len(A)), np.unique(i))],
    ignore_index=True, sort=False
)

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      3       35   1.0   30.0    40.0
2      3       35   2.0   30.0    50.0
3      4       45   2.0   30.0    50.0
4      1       15   NaN    NaN     NaN
5      2       25   NaN    NaN     NaN
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN
Joe
  • 2,994
  • 5
  • 31
  • 34
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    amazing solution.. can we say that this is a cross join... If I wanted to keep all rows of `A` only(basically left join on `A`) then what change would I need to make ? – joel.wilson Jan 31 '18 at 19:41
  • I wanted to reduce the bursting of rows thats happening too. Any thoughts? – joel.wilson Jan 31 '18 at 19:59
  • This is great, I was have the same question as Joel, would it be possible to keep all the values of table A, like a left join? – brandog Jun 26 '18 at 03:38
  • @piRSquared how would you do to keep only rows where A_id == B_id ? We can do it afterwards but I don't think that's the most efficient. In my case, I have an original df of 79k rows, it goes after your operation to 2.3m rows, then when I keep only rows where A_id == B_id, I have 74k rows which is what I expect. Can't this be done all at once ? – yeye Jun 15 '19 at 08:04
  • Found my way: a = A.A_value.values aId = A.A_id.values bId = B.B_id.values bh = B.B_high.values bl = B.B_low.values i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh) & (aId[:, None] == bId) – yeye Jun 15 '19 at 09:08
  • I'm trying to figure out how to join an aggregate of the values for A joined to B: Example: in the last dataframe shown I would like to get the average of A_value for line 2 and 3 so that only one record remains for B_id=2 unfortunately my pandas/numpy knowledge lacks here – Fenrir Aug 15 '19 at 09:04
  • The more accurate description of a SQL join with ALL of A and matching Bs (a_val BETWEEN b_low AND b_high), like in the second example is a LEFT OUTER JOIN. LEFT JOIN with neither INNER (like the first example) nor OUTER specified, defaults to INNER. – hjonez Oct 04 '19 at 00:50
  • hello @piRSquared, in addition to joining on overlapping values, how can I add an extra condition that each overlapping values must also have the same ID value. Something similar to the SQL equivalent of ```join A B on A.value BETWEEN B.low and B.high AND A.id = B.id``` – lollerskates Dec 06 '19 at 19:24
  • 2
    To very big datasets with may rows and columns, I got a memory overflow with this answer. =/ – xicocaio Aug 28 '20 at 19:38
  • 1
    @xicocaio I edited the answer to use pandas methods instead of numpy in the generation of the final DataFrame. This works much faster and consumes less memory, so maybe this will solve your overflow issue – Joe Mar 02 '22 at 14:23
  • @joel.wilson, great I will test it. Thanks! – xicocaio Mar 02 '22 at 18:26
  • 1
    You should use `.iloc` instead of `.loc`, since the index received from numpy is always referring to `iloc`. `loc` can lead to unintended behavior – Christian Jul 10 '22 at 16:20
12

Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:

conn = sqlite3.connect(":memory:") 
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)

You can adapt the query as needed in your application

mnagel
  • 6,729
  • 4
  • 31
  • 66
Adonis
  • 4,670
  • 3
  • 37
  • 57
2

I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.

baloo
  • 517
  • 1
  • 5
  • 13
2

conditional_join from pyjanitor may be helpful in the abstraction/convenience;:

# pip install pyjanitor
import pandas as pd
import janitor

inner join

A.conditional_join(B, 
                   ('A_value', 'B_low', '>='), 
                   ('A_value', 'B_high', '<=')
                  )

   A_id  A_value  B_id  B_low  B_high
0     0        5     0      0      10
1     3       35     1     30      40
2     3       35     2     30      50
3     4       45     2     30      50

left join

A.conditional_join(
       B, 
       ('A_value', 'B_low', '>='), 
       ('A_value', 'B_high', '<='), 
       how = 'left'
    )

    A_id  A_value  B_id  B_low  B_high
0      0        5   0.0    0.0    10.0
1      1       15   NaN    NaN     NaN
2      2       25   NaN    NaN     NaN
3      3       35   1.0   30.0    40.0
4      3       35   2.0   30.0    50.0
5      4       45   2.0   30.0    50.0
6      5       55   NaN    NaN     NaN
7      6       65   NaN    NaN     NaN
8      7       75   NaN    NaN     NaN
9      8       85   NaN    NaN     NaN
10     9       95   NaN    NaN     NaN
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • I get the following error: `Cannot interpret ' [ 5, 15, 25, 35, 35, 35, 45, 45, 45, 55, 55, 55, 55, 65, 65, 65, 65, 75, 75, 75, 75] Length: 21, dtype: int64' as a data type` Any ideas? – Quixotic22 Nov 02 '21 at 14:56
  • Hi @Quixotic22, a new version of `pyjanitor` was recently released; you should not have that error anymore. You can leave an issue on the GitHub page if you encounter any issues. – sammywemmy Dec 02 '21 at 03:27
1

lets take a simple example:

df=pd.DataFrame([2,3,4,5,6],columns=['A'])

returns

    A
0   2
1   3
2   4
3   5
4   6

now lets define a second dataframe

df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]

results in

    B_low   B_high
0   1       2
1   6       8
2   2       4
3   3       6
4   5       6

here we go; and we want output to be index 3 and A value 5

df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()

results in

    A
3   5.0
suvy
  • 693
  • 6
  • 18
1

I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.

In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:

df_left = pd.DataFrame({
    "time_from": [1, 4, 10, 21],
    "time_to": [3, 7, 15, 27]
})

df_right = pd.DataFrame({
    "time": [2, 6, 16, 25]
})

df_left

  time_from time_to
0        1       3
1        4       7
2       10      15
3       21      27


df_right

  time
0    2
1    6
2   16
3   25

First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:

merged = pd.merge_asof(
    left=df_1,
    right=df_2.rename(columns={"time": "candidate_match_1"}),
    left_on="time_from",
    right_on="candidate_match_1",
    direction="forward"
)

merged

  time_from time_to candidate_match_1
0        1       3                 2
1        4       7                 6
2        10      15               16
3        21      27               25

As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.

Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:

merged = pd.merge_asof(
    left=merged,
    right=df_2.rename(columns={"time": "candidate_match_2"}),
    left_on="time_to",
    right_on="candidate_match_2",
    direction="backward"
)

merged

  time_from time_to candidate_match_1   candidate_match_2
0        1        3                2                   2
1        4        7                6                   6
2        10      15               16                   6
3        21      27               25                  25

Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:

merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
    merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]

merged

  time_from time_to candidate_match_1   candidate_match_2   match
0        1       3                 2                   2       2
1        4       7                 6                   6       6
2        10     15                16                   6    None
3        21     27                25                   25     25