1

I have 2 dataframes (df_a and df_b) with 2 columns: 'Animal' and 'Name'.

In the bigger dataframe, there are more animals of the same type than the other. How do I find the extra animals of the same type by name? i.e. (df_a - df_b)

Dataframe A

Animal  Name
dog     john
dog     henry
dog     betty
dog     smith
cat     charlie
fish    tango
lion    foxtrot
lion    lima

Dataframe B

Animal  Name
dog     john
cat     charlie
dog     betty
fish    tango
lion    foxtrot
dog     smith

In this case, the extra would be:

Animal  Name
dog     henry
lion    lima

Attempt: I tried using

df_c = df_a.subtract(df_b, axis='columns')

but got the following error "unsupported operand type(s) for -: 'unicode' and 'unicode'", which makes sense since they are strings not numbers. Is there any other way?

user44840
  • 311
  • 2
  • 9

2 Answers2

2

You are looking for a left_only merge.

merged = pd.merge(df_a,df_b, how='outer', indicator=True)
merged.loc[merged['_merge'] == 'left_only'][['Animal', 'Name']]

Output

    Animal  Name
1   dog    henry
7   lion    lima

Explanation:

merged = pd.merge(df_a,df_b, how='outer', indicator=True)

Gives:

  Animal    Name    _merge
0   dog     john    both
1   dog     henry   left_only
2   dog     betty   both
3   dog     smith   both
4   cat     charlie both
5   fish    tango   both
6   lion    foxtrot both
7   lion    lima    left_only

The extra animals are in df_a only, which is denoted by left_only.

harvpan
  • 8,571
  • 2
  • 18
  • 36
  • I've been told that friends don't let friends chain calls. You might want to change this to `.loc` like `merged.loc[merged['_merge'] == 'left_only', ['Animal', 'Name']]`. It's also important to note the flexibility of this answer where you can use `'right_only'` and `'both'` to get values from only `df_b` or that are in both, respectively. – Eric Ed Lohmar Jul 31 '18 at 19:36
  • I did not get your `"friends don't let friends chain calls"` bit. What do you mean? If you think the answer is flexible, you can upvote :( @EricEdLohmar – harvpan Jul 31 '18 at 19:43
  • It's not usually good practice to refer to a part of a dataframe like this `df['column']['row']` at least partly because it may or may not return a copy of the data, as opposed to a view of it. `.loc` and `.iloc` are supposed to be used to get data from dataframes when multiple dimensions need to be called. It's really a bigger issue when assigning and not querying, though. [reference link](https://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy) – Eric Ed Lohmar Jul 31 '18 at 19:52
1

Using isin

df1[~df1.sum(1).isin(df2.sum(1))]
Out[611]: 
  Animal   Name
1    dog  henry
7   lion   lima
BENY
  • 317,841
  • 20
  • 164
  • 234