Anti-Join Pandas

Question

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).

I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.

I am familiar with R and this is the code I would use to do this in R.

library("dplyr")

## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")

## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)

How would I achieve this in python?

By key i mean a column contained in both tables with values to merge on. — Ayelavan, Jul 22 '16 at 04:19
I've added an answer. Please let me know what you think via feedback, and if it solves your exercise. Thank you! — Jossie Calderon, Jul 22 '16 at 07:31
I keep being amazed at how intuitive and elegant dplyr makes data processing. — user2105469, Nov 08 '18 at 08:07
Does this answer your question? ["Anti-merge" in pandas (Python)](https://stackoverflow.com/questions/38242368/anti-merge-in-pandas-python) — HarmlessEcon, Aug 24 '21 at 01:47

Dennis Lyubyvy · Answer 1 · 2020-05-02T12:43:57.907

79

indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:

left_only
right_only
both

Keep right_only and left_only. That is it.

outer_join = TableA.merge(TableB, how = 'outer', indicator = True)

anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)

easy!

Here is a comparison with a solution from piRSquared:

1) When run on this example matching based on one column, piRSquared's solution is faster.

2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.

So it's up for you to decide.

edited May 02 '20 at 12:43

answered Apr 05 '19 at 21:45

Dennis Lyubyvy

1,020
7
8

3

in an anti join, should we not keep just `left_only` (and leave out `right_only` too along with `both' ) – piedpiper Mar 26 '22 at 20:00
I think you're correct, @piedpiper. – matmat Jun 01 '22 at 22:02

piRSquared · Accepted Answer · 2016-07-22T07:26:32.517

44

Consider the following dataframes

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

TableA

TableB

This is one way to do what you want

Method 1

# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)

# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)

Method 2

rows = []
for i, row in TableB.iterrows():
    if row.Key not in TableA.Key.values:
        rows.append(row)

pd.concat([TableA.T] + rows, axis=1).T

Timing

4 rows with 2 overlap

Method 1 is much quicker

10,000 rows 5,000 overlap

loops are bad

edited Jul 22 '16 at 07:26

answered Jul 22 '16 at 01:38

piRSquared

285,575
57
475
624

Thanks. Just had a look at the documentation for combine_first and it seems to be kind of what I am looking for. However, how do I specify the column to combine on (key)? For each row in Table B I want it to check the value in the "Key" column and if that value is contained in Table A's "Key" column I want to ignore that row in B. Where there is a key value in table B that is not in Table A I want to append that row from Table B. – Ayelavan Jul 22 '16 at 04:38
@Ayelavan after thinking about it, I cannot recommend `combine_first`. The problem being that if a key exist in A but some of the columns are null, those nulls might be filled in by B's values with same key. the answer to the question you posed in comments is that `combine_first` expects the key to be in the index. – piRSquared Jul 22 '16 at 07:28
Thats exactly what I was looking for. Thank you very much – Ayelavan Jul 22 '16 at 09:55
2

You can simplify Method 1 to: `TableA.append(TableB.loc[~TableB.key.isin(TableA.key)], ignore_index=True)` – gnelson Dec 11 '18 at 17:49
This solution does NOT work when there are multiple key columns (or at least it doesn't work without some extra finagling)! – zozo Dec 17 '22 at 00:15

score 10 · Answer 3 · answered May 26 '17 at 11:48

I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:

import pandas as pd
import numpy as np

TableA = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('abcd'), name='Key'),
                      ['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
                      pd.Index(list('aecf'), name='Key'),
                      ['A', 'B', 'C']).reset_index()

print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')

TableB_only = pd.merge(
    TableA, TableB,
    how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
        '_merge == "right_only"')

print('TableB_only', TableB_only, sep='\n')

Table_concatenated = pd.concat((TableA, TableB_only), join='inner')

print('Table_concatenated', Table_concatenated, sep='\n')

Which prints this output:

TableA
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
TableB
  Key         A         B         C
0   a  0.754538  0.692902  0.537704
1   e  0.499092  0.864145  0.004559
2   c  0.082087  0.682573  0.421654
3   f  0.768914  0.281617  0.924693
TableB_only
  Key  A_foo  B_foo  C_foo         A         B         C      _merge
4   e    NaN    NaN    NaN  0.499092  0.864145  0.004559  right_only
5   f    NaN    NaN    NaN  0.768914  0.281617  0.924693  right_only
Table_concatenated
  Key         A         B         C
0   a  0.035548  0.344711  0.860918
1   b  0.640194  0.212250  0.277359
2   c  0.592234  0.113492  0.037444
3   d  0.112271  0.205245  0.227157
4   e  0.499092  0.864145  0.004559
5   f  0.768914  0.281617  0.924693

Jamie Marshall · Answer 4 · 2018-07-31T01:49:57.153

5

Easiest answer imaginable:

tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")

answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]

Should be the fastest proposed as well.

edited Jul 31 '18 at 01:49

answered Jul 31 '18 at 01:31

Jamie Marshall

1,885
3
27
50

score 2 · Answer 5 · answered Aug 13 '21 at 15:05

2

One liner

TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True)

%%timeit gives about the same timing as the accepted answer.

answered Aug 13 '21 at 15:05

asiera

492
5
12

score 1 · Answer 6 · answered Jul 22 '16 at 07:29

You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.

Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:

import pandas as pd

TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])

removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same

for i in range(0, len(keyColumnA)):
    firstValue = keyColumnA[i]
    for j in range(0, len(keyColumnB)):
        copycat = keyColumnB[j]
        if firstValue == copycat:
            removeTheseIndexes.append(j)

TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)

Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.

# Table A
   0  1   2
0  2  3   4
1  5  6   7
2  8  9  10

# Table B
   0   1  2
0  1   3  4
1  5   7  8
2  9  10  0

# Set 'Key' column = 1
# Run the script after the loop

# Table A
   0   1   2
0  2   3   4
1  5   6   7
2  8   9  10
3  5   7   8
4  9  10   0

# Table B
   0   1  2
1  5   7  8
2  9  10  0

score 1 · Answer 7 · answered Jan 22 '19 at 23:29

Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()] to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()] you have a semi_join.

def anti_join(tableA,tableB,on):

   #if joining on index, make it into a column
   if tableB.index.name is not None:
       dummy = tableB.reset_index()[on]
   else:
       dummy = tableB[on]

   #create a dummy columns of 1s
   if isinstance(dummy, pd.Series):
       dummy = dummy.to_frame()

   dummy.loc[:,'dummy_col'] = 1

   #preserve the index of tableA if it has one
   if tableA.index.name is not None:
       idx_name = tableA.index.name
       tableA = tableA.reset_index(drop = False)
   else:
       idx_name = None

   #do a left-join
   merged = tableA.merge(dummy,on=on,how='left')

   #keep only the non-matches
   output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]

   #reset the index (if applicable)
   if idx_name is not None:
       output = output.set_index(idx_name)

   return(output)

Anti-Join Pandas

7 Answers7

Method 1

Method 2

Timing

Linked

Related