0

I have two dataframe columns that looks something like this:

   col1    col2 
0   A2      B8
1   B8      C3
2   D2      A2 

I want to identify all values in col1 which are not in col2 and visa versa. In the above example, the output would be:

C3, D2 

Both lists are not unique.

Cahn Ex
  • 61
  • 1
  • 4
  • 11
  • 1
    Possible duplicate of [Compare two columns using pandas](https://stackoverflow.com/questions/27474921/compare-two-columns-using-pandas) – Avión Jul 19 '17 at 14:43

2 Answers2

1

Why not convert each column to a set and calculate the symmetric difference?

import pandas as pd

df = pd.DataFrame({'col1': ['A2', 'B8', 'D2'],
                   'col2': ['B8', 'C3', 'A2']})

print set(df['col1']).symmetric_difference(set(df['col2']))

This prints:

set(['C3', 'D2'])

EDIT:

If you want to track which elements came from where, you could adapt this and create a new dictionary, like so:

col1 = set(df['col1'])
col2 = set(df['col2'])

exclusive_items = {'col1': [x for x in col1 if x not in col2],
                   'col2': [x for x in col2 if x not in col1]}

print exclusive_items

Where each key in exclusive_items contains as its values the entries unique to that column. This prints:

{'col2': ['C3'], 'col1': ['D2']}

In fact, as we have sets we could simplify this to:

exclusive_items = {'col1': col1-col2,
                   'col2': col2-col1}
asongtoruin
  • 9,794
  • 3
  • 36
  • 47
  • With this method, is there a way of knowing which was in col1 and not in col2 and which was in col2 and not in col1? – Cahn Ex Jul 19 '17 at 14:48
  • @CahnEx not strictly speaking, but I've added a modified version to demonstrate how this could be possible – asongtoruin Jul 19 '17 at 14:55
0
import pandas as pd

df = pd.DataFrame({'col1': ['A2', 'B8', 'D2'],
                   'col2': ['B8', 'C3', 'A2']})


in_col1_not_in_col2 = list(set(df['col1'])-(set(df['col2'])))
in_col2_not_in_col1 = list(set(df['col2'])-(set(df['col1'])))

print('in_col1_not_in_col2: ')
print(in_col1_not_in_col2)
print('in_col2_not_in_col1: ')
print(in_col2_not_in_col1)

in_col1_not_in_col2: ['D2']

in_col2_not_in_col1: ['C3']

paolof89
  • 1,319
  • 5
  • 17
  • 31