23

We have two data frames here:

the expected dataframe:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|   romin|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

and the actual data frame:

+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
|     3|  Chennai|  rahman|9848022330|  45000|SanRamon|
|     1|Hyderabad|     ram|9848022338|  50000|      SF|
|     2|Hyderabad|   robin|9848022339|  40000|      LA|
|     4|  sanjose|  romino|9848022331|  45123|SanRamon|
+------+---------+--------+----------+-------+--------+

the difference between the two dataframes now is:

+------+--------+--------+----------+-------+--------+
|emp_id|emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+--------+--------+----------+-------+--------+
|     4| sanjose|  romino|9848022331|  45123|SanRamon|
+------+--------+--------+----------+-------+--------+

We are using the except function df1.except(df2), however the problem with this is, it returns the entire rows that are different. What we want is to see which columns are different within that row (in this case, "romin" and "romino" from "emp_name" are different). We have been having tremendous difficulty with it and any help would be great.

user4157124
  • 2,809
  • 13
  • 27
  • 42
rominoushana
  • 451
  • 2
  • 4
  • 17
  • 1
    Inner join and keep both emp_name and remove all rows where both are the same. – Alberto Bonsanto Jun 03 '17 at 01:19
  • 1
    Can you make assumptions on the data? for example can you assume emp_id is unique? or even better must be the same and only validation on its data is relevant? otherwise, why is this row different in emp_name and not completely different than one of the other emp_id – Assaf Mendelson Jun 03 '17 at 05:19
  • 1
    Similar question https://stackoverflow.com/questions/44807450/how-to-compare-two-dataframes-and-print-extra-rows-in-either-of-the-two-datafram/44810081#44810081 – Rahul Kanodiya Jun 28 '17 at 18:52

3 Answers3

45

From the scenario that is described in the above question, it looks like that difference has to be found between columns and not rows.

So, to do that we need to apply selective difference here, which will provide us the columns that have different values, along with the values.

Now, to apply selective difference we have to write code something like this:

  1. First we need to find the columns in expected and actual data frames.

    val columns = df1.schema.fields.map(_.name)

  2. Then we have to find the difference columnwise.

    val selectiveDifferences = columns.map(col => df1.select(col).except(df2.select(col)))

  3. At last we need to find out which columns contain different values.

    selectiveDifferences.map(diff => {if(diff.count > 0) diff.show})

And, we will get only the columns that contain different values. Like this:

+--------+
|emp_name|
+--------+
|  romino|
+--------+

I hope this helps!

AAMCODE
  • 415
  • 1
  • 7
  • 20
himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
  • 1
    This is perfect @himanshullTian. Thank you very much. I had the first two steps, but was missing that last key step! A follow up question is, what if there is an extra row in the actual dataframe? (expected has 4 rows and actual has 5). How do we distinguish that and print the entire row instead of printing every column out? – rominoushana Jun 05 '17 at 17:09
  • 5
    The Scala syntax is confusing me. Can anyone explain this in PySpark? – timbram Jun 18 '18 at 22:12
  • 1
    This is great @himanshullTian. I had to do something similar in Spark-Java to compare contents of two large csv files - I did not use the `columns.map` though - I looped through the array of csv headers. – ishmaelMakitla Dec 12 '18 at 06:32
1

list_col=[]
cols=df1.columns

# Prepare list of dataframes/per column
for col in cols:
  list_col.append(df1.select(col).subtract(df2.select(col)))

# Render/persist
for  l in list_col :
  if l.count() > 0 :
     l.show()
vivek mishra
  • 1,162
  • 8
  • 16
0

Spark-extensions have an API for this - DIFF. I believe you can use it like this:

left.diff(right).show()

Or supply emp_id as an id column, like this:

left.diff(right, "emp_id").show()

This API is available for Spark 2.4.x - 3.x.

Eyal
  • 3,412
  • 1
  • 44
  • 60
  • Hi, I am getting the below error: Py4JJavaError: An error occurred while calling None.uk.co.gresearch.spark.diff.DiffOptions. : java.lang.NoClassDefFoundError: scala/collection/StringOps$ I did `pip install pyspark-extension==2.6.0.3.3` My code below: `from gresearch.spark.diff import * df1.diff(df2).show()` – rams Apr 30 '23 at 01:07