How to use pandas to print the difference of two columns?

Question

I have two data sets

1 set it has a column with a list of email address:

DF1

Email
xxxx@abc.gov
xxxx@abc.gov
xxxx@abc.gov
xxxx@abc.gov
xxxx@abc.gov

2nd csv Dataframe2

Email
xxxx@abc.gov
xxxx@abc.gov
xxxx@abc.gov
xxxx@abc.gov
dddd@abc.com
dddd@abc.com
3333@abc.com

import pandas as pd

SansList = r'C:\\Sans compare\\SansList.csv'
AllUsers = r'C:\\Sans compare\\AllUser.csv'

## print Name column only and turn into data sets from CSV ##
df1 = pd.read_csv(SansList, usecols=[0])

df2 = pd.read_csv(AllUsers, usecols=[2])

**print(df1['Email'].isin(df2)==False)**

I want the results to be,

Dataframe3
dddd@abc.com
dddd@abc.com
3333@abc.com

Not quite sure how to fix my dataset... :(

I do not think you want to use pandas. Use sets. set(df1['Email'].values) and then set.intersection() — Keith, May 05 '17 at 22:43
http://stackoverflow.com/questions/14057007/remove-rows-not-isinx — davidjbeiler, May 05 '17 at 22:47

score 1 · Accepted Answer · answered May 05 '17 at 22:44

1

Option 1
isin

df2[~df2.Email.isin(df1.Email)]

          Email
4  dddd@abc.com
5  dddd@abc.com
6  3333@abc.com

Option 2
query

df2.query('Email not in @df1.Email')

          Email
4  dddd@abc.com
5  dddd@abc.com
6  3333@abc.com

Option 3
merge

pd.DataFrame.merge with indicator=True, enables you to see which dataframe the row came from. We can then filter on it.

df2.merge(
    df1, 'outer', indicator=True
).query('_merge == "left_only"').drop('_merge', 1)

           Email
20  dddd@abc.com
21  dddd@abc.com
22  3333@abc.com

answered May 05 '17 at 22:44

piRSquared

285,575
57
475
624

it keeps printing all of my full list of emails from df2 – davidjbeiler May 05 '17 at 23:09
You need to reassign the results back to `df2`. I did not overwrite your variable. Just do `df2 = df2[~df2.Email.isin(df1.Email)]` – piRSquared May 05 '17 at 23:11
i did that, its still printing everything from my master list, its not comparing :( – davidjbeiler May 05 '17 at 23:20
Then there is something wrong with your csv files such that the things you expect to be equal are in fact not equal. – piRSquared May 05 '17 at 23:21
1

apparently pandas is case sensitive :( – davidjbeiler May 08 '17 at 19:04

score 1 · Answer 2 · answered May 05 '17 at 22:52

1

Numpy solution:

In [311]: df2[~np.in1d(df2.Email, df1.Email)]
Out[311]:
          Email
4  dddd@abc.com
5  dddd@abc.com
6  3333@abc.com

answered May 05 '17 at 22:52

MaxU - stand with Ukraine

205,989
36
386
419

doesn't work, its printing everthing from my dataframe2, not the differences – davidjbeiler May 05 '17 at 22:54
@davidjbeiler, what do you mean? you read it in your code: `df2 = pd.read_csv(AllUsers, usecols=[2])`... – MaxU - stand with Ukraine May 05 '17 at 22:56
its printing everthing from my dataframe2, not the differences – davidjbeiler May 05 '17 at 22:57

How to use pandas to print the difference of two columns?

2 Answers2

Linked

Related