Find difference between two pandas dataframes when both contains same rows but one dataframe contains it more than once

Question

I have two pandas dataframes

df1

jon,12,NewYork
jon,12,NewYork
james,14,LA

df2

jon,12,NewYork
james,14,LA

I want to compare them and get the difference as below

deltaDF

jon,12,NewYork

I tried pd.concat([df1,df2,df2],axis=0,sort=False).drop_duplicates(keep=False) This works fine when there are not duplicates but doesn't give difference when one of the dataframe contains duplicates and other dataframe has single entry. I have also tried the solutions mentioned in Python Pandas - Find difference between two data frames but that is also returning empty dataframe in this case

Edit

People are telling that this is not possible. Can we do something like this:

Add a column giving the count of occurrence of each row

Convert above df1 to

jon,12,NewYork,2
james,14,LA,1

Convert above df2 to

jon,12,NewYork,1

Now I can use all columns as index and subtract the last column.

how actually your data frames are looking while executing, could you please share that — The Guy, Jan 17 '20 at 11:12
when you want to get the difference you have to compare sth. but what do you want to compare if there is a duplicate and only get rid of one? it won't work like you want it to — luigigi, Jan 17 '20 at 11:13
@luigigi In pyspark we have a method like exceptAll() to achieve this. I want something similar in pandas — user0204, Jan 17 '20 at 11:15
@TheGuy I have shared dataframes in the question. Please let me know if you need anything else — user0204, Jan 17 '20 at 11:15

luigigi · Accepted Answer · 2020-01-17T11:52:07.063

2

You can add a new column to catch the duplicates:

df1['merge'] = df1.groupby(['0','1','2']).cumcount()

df2['merge'] = df2.groupby(['0','1','2']).cumcount()

pd.concat([df1,df2]).drop_duplicates(keep=False)

Afterwards you can drop the added column again

edited Jan 17 '20 at 11:52

answered Jan 17 '20 at 11:47

luigigi

4,146
1
13
30

1

Perfect!! I was doing this `a = df1.groupby(list(df1.columns.values)).size()` `b = df2.groupby(list(df2.columns.values)).size()` `x=a-b` `x = x.reset_index(name='count')` `print(x[x['count']>0])` But yours seems better. Thanks for the help. cumcount is great. – user0204 Jan 17 '20 at 12:55

score 2 · Answer 2 · answered Jan 17 '20 at 11:48

df1 = pd.DataFrame({
    'name': ('jon', 'jon', 'james'),
    'age': (12, 12, 14),
    'city': ('NewYork', 'NewYork', 'LA'),
})

df2 = pd.DataFrame({
    'name': ('jon', 'james'),
    'age': (12, 14),
    'city': ('NewYork', 'LA'),
})


def unique_col(x):
    # generate record uniqueness criteria after grouping
    return ''.join([x['name'], str(x['age']), x['city'], str(x['counter'])])


# calculation of the same items
df1['counter'] = df1.groupby(['name', 'age', 'city']).cumcount() + 1
df2['counter'] = df2.groupby(['name', 'age', 'city']).cumcount() + 1
# unique key after grouped counters
df1['unique_key'] = df1.apply(unique_col, axis=1)
df2['unique_key'] = df2.apply(unique_col, axis=1)
# select items by unique key
df = df1[~df1['unique_key'].isin(df2['unique_key'].tolist())]
df = df.drop(columns=['counter', 'unique_key'])
print(df)

#  name  age     city
#1  jon   12  NewYork

See comments. Hope this helps.

Find difference between two pandas dataframes when both contains same rows but one dataframe contains it more than once

Similar questions

Edit

2 Answers2