How to merge dataframes based on an "OR" condition

Question

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.

However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.

In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.

So basically this is a either/or match condition (pesudo code):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

I hope my question makes sense... Any help is really really appreciated!

As suggested, I looked into this post: Python pandas merge with OR logic But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

did you see this post: https://stackoverflow.com/questions/43925603/python-pandas-merge-with-or-logic?rq=1 — Vico, Aug 24 '17 at 19:57
Hi @Vico I edited my post to include your suggested post, but I think that post did not answer my question :( — alwaysaskingquestions, Aug 24 '17 at 20:02
Do you want four columns total in your output? I.e. `A_tab1`, `A_tab2`, `B_tab1`, `B_tab2`? It seems like you'll need all of these if only `A` or `B` needs to match on any given row. — andrew_reece, Aug 24 '17 at 20:05
@andrew_reece that is fine if all columns has to be preserved; the key is to merge the tables correctly — alwaysaskingquestions, Aug 24 '17 at 20:06
Not so. They're slightly different. Have a go and see if it works. — cs95, Aug 24 '17 at 20:08
Anyway, more cannot be said without your data or expected output. — cs95, Aug 24 '17 at 20:11

score 22 · Accepted Answer · edited May 13 '23 at 19:39

22

Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks @Scott Boston for that final step).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  3  7     0  1  4
1  2  8     1  5  1
2  1  9     2  6  8
3  4  5     3  4  5

With these data frames we should see:

df1.loc[2] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]

We'll use suffixes to keep track of what matched where:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
                df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.

duplicates = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~duplicates]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

edited May 13 '23 at 19:39

IonicEcommerce

69
8

answered Aug 24 '17 at 20:16

andrew_reece

20,390
3
33
58

1

I think you need drop_duplicates after concat to make sure your eliminate record that may have successful joins on both A and B. – Scott Boston Aug 24 '17 at 20:19
Good catch! I'll update with the edge case in my sample data. – andrew_reece Aug 24 '17 at 20:22
is the concat same thing as append in this case? – alwaysaskingquestions Aug 24 '17 at 22:29
1

Yes, you could also do `df1.merge(on='A').append(df1.merge(on='B'))`. I think `concat` is often faster. – andrew_reece Aug 24 '17 at 23:03
@andrew_reece can you explain why you only has one argument "on" for the merge in your code df1.merge(on='A').append(df1.merge(on='B'))? what does that mean? – alwaysaskingquestions Aug 25 '17 at 19:48
It's the equivalent of the `if ... then merge` pseudocode in your original post. If you're new to the idea, Pandas docs on [Merge, Join, and Concatenate](https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging) are comprehensive and readable. – andrew_reece Aug 25 '17 at 19:52
How would you set up dups when you have more than 2 columns? IE A, B, C – IonicEcommerce May 11 '23 at 06:20

score 3 · Answer 2 · answered Jul 19 '18 at 11:51

3

I would suggest this alternate way for doing merge like this. This seems easier for me.

table1["id_to_be_merged"] = table1.apply(
    lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)

You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

answered Jul 19 '18 at 11:51

pratpor

1,954
1
27
46

2

__apply__ and __lambdas__ are not for beginners, and thus, lacking an explanation, makes the answer not as helpful as could be. You could also make your code a bit more readable. Does it beat the method above in performance? – MikeMajara Jun 07 '19 at 12:23
1

I don't think this solves the problem. I think most of the rows for the question asker include both a valid ship number and a valid track number. Also, the question asker isn't looking to match table 1 ship numbers to table 2 track numbers or vice versa. It has to be `(table_1.ShipNumber == table2.ShipNumber) or (table_1.TrackNumber == table2.TrackNumber)` . Matching `table_1.ShipNumber == table_2.TrackNumber` would be a problem. – JDenman6 Sep 22 '20 at 14:11

score 0 · Answer 3 · edited May 26 '23 at 01:05

0

import pandas as pd

df1 = pd.DataFrame([['1','2','3'],['4','5']])
df2 = pd.DataFrame([['1','2','3'],['4','5']])

df = df1.merge(df2, how='inner', left_on=df1[2].fillna(df1[1]), right_on=df2[2].fillna(df2[1]))

# result
  key_0 0_x 1_x   2_x 0_y 1_y   2_y
0     3   1   2     3   1   2     3
1     5   4   5  None   4   5  None

edited May 26 '23 at 01:05

Trenton McKinney

56,955
33
144
158

answered May 25 '23 at 15:31

user8620988

1
1

2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 25 '23 at 19:58
This answer was reviewed in the [Low Quality Queue](https://stackoverflow.com/help/review-low-quality). Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Code only answers are **not considered good answers**, and are likely to be downvoted and/or deleted because they are **less useful** to a community of learners. It's only obvious to you. Explain what it does, and how it's different / **better** than existing answers. [From Review](https://stackoverflow.com/review/low-quality-posts/34448694) – Trenton McKinney May 25 '23 at 22:47
Hello, please don't just submit code in your answer(s), add some details as to why you think this is the optimal solution. – Destroy666 May 26 '23 at 17:28

How to merge dataframes based on an "OR" condition

3 Answers3

Linked