How to compare two dataframe and print columns that are different in scala

Question

We have two data frames here:

the expected dataframe:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

and the actual data frame:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

Inner join and keep both emp_name and remove all rows where both are the same. — Alberto Bonsanto, Jun 03 '17 at 01:19
Can you make assumptions on the data? for example can you assume emp_id is unique? or even better must be the same and only validation on its data is relevant? otherwise, why is this row different in emp_name and not completely different than one of the other emp_id — Assaf Mendelson, Jun 03 '17 at 05:19
Similar question https://stackoverflow.com/questions/44807450/how-to-compare-two-dataframes-and-print-extra-rows-in-either-of-the-two-datafram/44810081#44810081 — Rahul Kanodiya, Jun 28 '17 at 18:52

score 45 · Accepted Answer · edited Jul 06 '22 at 08:48

45

From the scenario that is described in the above question, it looks like that difference has to be found between columns and not rows.

So, to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

Now, to apply selective difference we have to write code something like this:

First we need to find the columns in expected and actual data frames.

val columns = df1.schema.fields.map(_.name)
Then we have to find the difference columnwise.

val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))
At last we need to find out which columns contain different values.

selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns that contain different values. Like this:

+--------+
|emp_name|
+--------+
|  romino|
+--------+

I hope this helps!

edited Jul 06 '22 at 08:48

AAMCODE

415
1
7
20

answered Jun 03 '17 at 08:22

himanshuIIITian

5,985
6
50
70

1

This is perfect @himanshullTian. Thank you very much. I had the first two steps, but was missing that last key step! A follow up question is, what if there is an extra row in the actual dataframe? (expected has 4 rows and actual has 5). How do we distinguish that and print the entire row instead of printing every column out? – rominoushana Jun 05 '17 at 17:09
5

The Scala syntax is confusing me. Can anyone explain this in PySpark? – timbram Jun 18 '18 at 22:12
1

This is great @himanshullTian. I had to do something similar in Spark-Java to compare contents of two large csv files - I did not use the `columns.map` though - I looped through the array of csv headers. – ishmaelMakitla Dec 12 '18 at 06:32

vivek mishra · Answer 2 · 2020-05-21T16:30:47.460

1


list_col=[]
cols=df1.columns

# Prepare list of dataframes/per column
for col in cols:
  list_col.append(df1.select(col).subtract(df2.select(col)))

# Render/persist
for  l in list_col :
  if l.count() > 0 :
     l.show()

edited May 21 '20 at 16:30

answered May 21 '20 at 16:25

vivek mishra

1,162
8
16

1

So, this will give you the different rows for the columns in two dataframes? – whatsinthename Aug 19 '20 at 13:10
1

it will iterate through for each column and give list of all columns which differs in values across all rows. – vivek mishra Aug 21 '20 at 16:16

score 0 · Answer 3 · answered May 23 '22 at 16:57

0

Spark-extensions have an API for this - DIFF. I believe you can use it like this:

left.diff(right).show()

Or supply emp_id as an id column, like this:

left.diff(right, "emp_id").show()

This API is available for Spark 2.4.x - 3.x.

answered May 23 '22 at 16:57

Eyal

3,412
1
44
60

Hi, I am getting the below error: Py4JJavaError: An error occurred while calling None.uk.co.gresearch.spark.diff.DiffOptions. : java.lang.NoClassDefFoundError: scala/collection/StringOps$ I did `pip install pyspark-extension==2.6.0.3.3` My code below: `from gresearch.spark.diff import * df1.diff(df2).show()` – rams Apr 30 '23 at 01:07

How to compare two dataframe and print columns that are different in scala

3 Answers3

Linked