-1

How do i find the most efficient way to check what rows differ in a pandas dataframe?

Imagine we have the following pandas dataframes, df1 and df2.

df1 = pd.DataFrame([[a,b],[c,d],[e,f]], columns=['First', 'Last'])

df2 = pd.DataFrame([[a,b],[e,f],[g,h]], columns=['First', 'Last'])

In this case, row index 0 of df1 would be [a,b]; row index 1 of df1 would be [c,d] etc

I want to know what is the most efficient way to find what rows these dataframes differ.

In particular, although [e,f] appears in both dataframes - in df1 it is at index 2 and in df2 it is in index 1, I would want my outcome to show this.

something like diff(df1,df2) = [1,2]

I know I could loop through all the rows and check if df1.loc[i,:] == df2.loc[i,:] for i in range(len(df1)) but is there a more efficient way?

  • Please repeat [on topic](https://stackoverflow.com/help/on-topic) and [how to ask](https://stackoverflow.com/help/how-to-ask) from the [intro tour](https://stackoverflow.com/tour). “Show me how to solve this coding problem” is not a Stack Overflow issue. We expect you to make an honest attempt, and *then* ask a *specific* question about your algorithm or technique. Stack Overflow is not intended to replace existing documentation and tutorials. – Prune Feb 10 '21 at 23:05
  • I did make an honest attempt. read the bottom of the post - it literally suggests a method on how to solve this problem. i want to know if there is a more efficient way? Why dont you learn to read the post in full before making comments and downvoting – pablo_mathscobar Feb 10 '21 at 23:06
  • Got it -- I missed the phrase because you did not provide it as the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (MRE). Most of all, there is no recognition of vectorized PANDAS operations, which is a basic technique you pick up from tutorials, not from Stack Overflow. Please research how to apply conditions and searches to columns as a whole. – Prune Feb 10 '21 at 23:09
  • I am aware that you can do df1.apply(lambda x: ....), i just want to know what the function would that would allow me to check this column as a whole? – pablo_mathscobar Feb 10 '21 at 23:14
  • Again, manipulating whole columns (not `apply`, but the implied column syntax) is something to learn from tutorials. The brute-force way would be to merge the two DFs and then compare the desired columns, whose data values are now in the same row of a single DF. – Prune Feb 10 '21 at 23:16
  • perhaps https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html – Jonathan Leon Feb 11 '21 at 04:09

1 Answers1

0

You may be looking for this :

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)

From https://stackoverflow.com/a/57812527/15179457.

LucasG0
  • 111
  • 2