1

I have two pandas dataframes

df1

jon,12,NewYork
jon,12,NewYork
james,14,LA

df2

jon,12,NewYork
james,14,LA

I want to compare them and get the difference as below

deltaDF

jon,12,NewYork

I tried pd.concat([df1,df2,df2],axis=0,sort=False).drop_duplicates(keep=False) This works fine when there are not duplicates but doesn't give difference when one of the dataframe contains duplicates and other dataframe has single entry. I have also tried the solutions mentioned in Python Pandas - Find difference between two data frames but that is also returning empty dataframe in this case

Similar questions

I think this is not a duplicate question because an answer given to this question returning empty dataframe for the above scenario.

Edit

People are telling that this is not possible. Can we do something like this:

Add a column giving the count of occurrence of each row

Convert above df1 to

jon,12,NewYork,2
james,14,LA,1

Convert above df2 to

jon,12,NewYork,1

Now I can use all columns as index and subtract the last column.

halfer
  • 19,824
  • 17
  • 99
  • 186
user0204
  • 231
  • 3
  • 18
  • how actually your data frames are looking while executing, could you please share that – The Guy Jan 17 '20 at 11:12
  • when you want to get the difference you have to compare sth. but what do you want to compare if there is a duplicate and only get rid of one? it won't work like you want it to – luigigi Jan 17 '20 at 11:13
  • @luigigi In pyspark we have a method like exceptAll() to achieve this. I want something similar in pandas – user0204 Jan 17 '20 at 11:15
  • @TheGuy I have shared dataframes in the question. Please let me know if you need anything else – user0204 Jan 17 '20 at 11:15

2 Answers2

2

You can add a new column to catch the duplicates:

df1['merge'] = df1.groupby(['0','1','2']).cumcount()

df2['merge'] = df2.groupby(['0','1','2']).cumcount()

pd.concat([df1,df2]).drop_duplicates(keep=False)

Afterwards you can drop the added column again

luigigi
  • 4,146
  • 1
  • 13
  • 30
  • 1
    Perfect!! I was doing this `a = df1.groupby(list(df1.columns.values)).size()` `b = df2.groupby(list(df2.columns.values)).size()` `x=a-b` `x = x.reset_index(name='count')` `print(x[x['count']>0])` But yours seems better. Thanks for the help. cumcount is great. – user0204 Jan 17 '20 at 12:55
2
df1 = pd.DataFrame({
    'name': ('jon', 'jon', 'james'),
    'age': (12, 12, 14),
    'city': ('NewYork', 'NewYork', 'LA'),
})

df2 = pd.DataFrame({
    'name': ('jon', 'james'),
    'age': (12, 14),
    'city': ('NewYork', 'LA'),
})


def unique_col(x):
    # generate record uniqueness criteria after grouping
    return ''.join([x['name'], str(x['age']), x['city'], str(x['counter'])])


# calculation of the same items
df1['counter'] = df1.groupby(['name', 'age', 'city']).cumcount() + 1
df2['counter'] = df2.groupby(['name', 'age', 'city']).cumcount() + 1
# unique key after grouped counters
df1['unique_key'] = df1.apply(unique_col, axis=1)
df2['unique_key'] = df2.apply(unique_col, axis=1)
# select items by unique key
df = df1[~df1['unique_key'].isin(df2['unique_key'].tolist())]
df = df.drop(columns=['counter', 'unique_key'])
print(df)

#  name  age     city
#1  jon   12  NewYork

See comments. Hope this helps.

Danila Ganchar
  • 10,266
  • 13
  • 49
  • 75