Opposite of left outer join in pandas based on one column results in less records than expected

Question

I have the following two dataframes:

df2 has 2166 unique records.

I am trying to return the data in the blue area so I am using the following technique:

To get the green data I simply perform an inner join:
df_common = df1.merge(df2, how='inner' on='SKU')
which yields 413 unique records

To get the blue area, I appended the green data to df2:
df2_plus_green = df2.append(df_common, ignore_index = True)
which yields 2579 records (which makes sense (413 + 2166 = 2579)

Then I dropped the duplicates:
df_blue = df2_plus_green.drop_duplicates(keep=False, subset=['SKU'])
which yields 1666 records. I would have thought it would yield 1753 records (2166 - 413 = 1753)

I have tried the above technique on a much smaller set of data and it works as expected. I assume the extra 87 records that are being dropped are duplicates somehow but I have checked each dataframe at each step along the way and all dataframes are made up of completely unique records. Can anyone point me in the right direction? I am sure it is something obvious that I'm missing.

Please explain how to read the diagram. What are the columns of the dfs? PS A [mre] is obligatory in debugging questions. — philipxy, Jan 08 '22 at 04:08
Sorry for the lack of clarity. Both df's are comprised of the same columns but I am only merging and dropping duplicates based on one column 'SKU' so the other columns are irrelevant. DF1 has 1519 unique records and DF2 has 2166 unique records. The Venn diagram is meant to explain that there are common records between the fields, similar to how W3Schools explains how inner joins work in the venn diagrams halfway down the page here: https://www.w3schools.com/sql/sql_join.asp — user2382321, Jan 08 '22 at 04:32
Please clarify via edits, not comments. PS Even if the dfs have the same columns, it matters what you are joining on & what the keys are & whether there are duplicates & nulls. You really need to give details of exactly what circumstances you have & assume including constraints. PS Those Venn diagrams you see people parroting everywhere are useless & misleading for explaining outer vs inner join & moreso for explaining inner join. See [Venn Diagram for Natural Join](https://stackoverflow.com/a/55642928/3404097) & comments of mine at its linked question pages. — philipxy, Jan 08 '22 at 05:04

score 0 · Accepted Answer · edited Jan 08 '22 at 04:30

0

While you're merging both the tables, use the Parameter "Indicator" which will tell us how each and every data point is joined.

df_right_only = pd.merge(df1, df2, on = "Common Column", how = "right", indicator = True)

Then we can use LOC to filter out only the values that are pulled uniquely from the right table.

df_right_only.loc[df_right_only ['_merge'] == "right_only", 'Column that you want']

This approach is known as Anti-Joins.

edited Jan 08 '22 at 04:30

Mayank Porwal

33,470
8
37
58

answered Jan 08 '22 at 04:10

Gowtham Jayachandiran

118
8

Definitely. I just did. Can you please explain why you're answer answers my question? Why is your method of anti join result in a different number than mine (i.e 1753 vs 1666)? – user2382321 Jan 08 '22 at 06:34

Opposite of left outer join in pandas based on one column results in less records than expected

1 Answers1