23

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.

However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.

In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.

So basically this is a either/or match condition (pesudo code):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

I hope my question makes sense... Any help is really really appreciated!

As suggested, I looked into this post: Python pandas merge with OR logic But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

3 Answers3

22

Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks @Scott Boston for that final step).

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  3  7     0  1  4
1  2  8     1  5  1
2  1  9     2  6  8
3  4  5     3  4  5

With these data frames we should see:

  • df1.loc[2] matches A on df2.loc[0]
  • df1.loc[1] matches B on df2.loc[2]
  • df1.loc[3] matches both A and B on df2.loc[3]

We'll use suffixes to keep track of what matched where:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
                df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.

duplicates = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~duplicates]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
3

I would suggest this alternate way for doing merge like this. This seems easier for me.

table1["id_to_be_merged"] = table1.apply(
    lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)

You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

pratpor
  • 1,954
  • 1
  • 27
  • 46
  • 2
    __apply__ and __lambdas__ are not for beginners, and thus, lacking an explanation, makes the answer not as helpful as could be. You could also make your code a bit more readable. Does it beat the method above in performance? – MikeMajara Jun 07 '19 at 12:23
  • 1
    I don't think this solves the problem. I think most of the rows for the question asker include both a valid ship number and a valid track number. Also, the question asker isn't looking to match table 1 ship numbers to table 2 track numbers or vice versa. It has to be `(table_1.ShipNumber == table2.ShipNumber) or (table_1.TrackNumber == table2.TrackNumber)` . Matching `table_1.ShipNumber == table_2.TrackNumber` would be a problem. – JDenman6 Sep 22 '20 at 14:11
0
import pandas as pd

df1 = pd.DataFrame([['1','2','3'],['4','5']])
df2 = pd.DataFrame([['1','2','3'],['4','5']])

df = df1.merge(df2, how='inner', left_on=df1[2].fillna(df1[1]), right_on=df2[2].fillna(df2[1]))

# result
  key_0 0_x 1_x   2_x 0_y 1_y   2_y
0     3   1   2     3   1   2     3
1     5   4   5  None   4   5  None
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • 2
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 25 '23 at 19:58
  • This answer was reviewed in the [Low Quality Queue](https://stackoverflow.com/help/review-low-quality). Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Code only answers are **not considered good answers**, and are likely to be downvoted and/or deleted because they are **less useful** to a community of learners. It's only obvious to you. Explain what it does, and how it's different / **better** than existing answers. [From Review](https://stackoverflow.com/review/low-quality-posts/34448694) – Trenton McKinney May 25 '23 at 22:47
  • Hello, please don't just submit code in your answer(s), add some details as to why you think this is the optimal solution. – Destroy666 May 26 '23 at 17:28