0

I have two data frames from a data export today and yesterday and want to find out if was any update on the data. The export varies and I cannot just perform a simple compare like with

df3 = pd.concat([df1,df2]).drop_duplicates(keep=False)

Because it will give me the unique lines from both df.

I need to find out if DF1 contains lines from DF2 that have different values. And I just want the rows from DF1 with the updates. The columns ID is the common identifier for rows in both DF. It is also unique in both df. But DF2 can contain more rows that DF1 and to make things worse the object type may differ as well. So for a proper compare both df should be converted to str only or the object type should be ignored for the compare.

data frame 1

id name last login last order total spent
1 Sean 1/1/2023 1/1/2023 432
2 John 1/1/2023 1/1/2023 98
3 Jim 1/1/2023 1/1/2023 123

data frame 2

id name last login last order total spent
1 Sean 1/1/2023 1/1/2023 432
2 John 1/6/2022 1/6/2022 2
3 Jim 1/1/2023 1/1/2023 123
4 Joe 7/8/2022 7/8/2022 1503

expected result:

id name last login last order total spent
2 John 1/1/2023 1/1/2023 98

I have tried to different ways to avoid looping thru the rows, but have not found a solution yet. Another hurdle is that pandas seems to keep changing the object type of columns while working with it(int become float, boolean become object), which make the compare even more difficult.

"Anti-merge" in pandas (Python) does not answer my question as it gives back unique lines from both data frames, not just df1. It is just a more complicated way for the method with concat I have in my question.

Empusas
  • 372
  • 2
  • 17

2 Answers2

1

One possibility might be to merge:

pd.merge(df1,df2,how = 'outer',indicator=True).loc[lambda x: x.pop('_merge').eq('left_only')]

Output:

   id  name last login last order  total spent
1   2  John   1/1/2023   1/1/2023           98
rhug123
  • 7,893
  • 1
  • 9
  • 24
  • This answer looks most promising, but it did not address the issue of different object types. I get a value error " You are trying to merge on float64 and object columns. If you wish to process you should you pd.concat" – Empusas Aug 01 '23 at 08:03
  • perhaps when you read in these df's you can specify their dtypes before hand with the `dtype` variable in `pd.read_csv()` https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html – rhug123 Aug 01 '23 at 13:36
  • yes, this is how I do it so far. The problem is that every time you copy a df it will change object types again. That is quite annoying. – Empusas Aug 02 '23 at 14:36
0

This is another solution

#Trim df2 down to match the id in df1
df2_trimmed = df2.loc[ [id for id in df2.index if id in df1.index] ]

#Find the indices where the df2 rows don't all match df1
#Then index into df1
mismatch = df1.loc[ (df2_trimmed == df1).all(axis=1) == False ]

enter image description here

some3128
  • 1,430
  • 1
  • 2
  • 8