Spark - compare 2 dataframes without using hardcoded column names

Question

I need to create a way to compare 2 data frames and do not use any hardcoded data, so that I can upload at any time 2 files and compare them without changing anything.

df1
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB.           |colC      |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              20|     APPLE|
|(122C8984ABF9F6EF...|       0|              10|    GOOGLE|
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              15|   SAMSUNG|
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
+--------------------+--------+----------------+----------+


df2
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB            |colC      |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              20|     APPLE|
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              30|     APPLE|
|(122C8984ABF9F6EF...|       0|              15|   SAMSUNG|
|(122C8984ABF9F6EF...|       0|              15|    GOOGLE|
+--------------------+--------+----------------+----------+

I need to compare these 2 data frames and count the differences from each column. My output should look like this:|

+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+------+
|      colA|            6|                0|       0.0 %|  Pass|
colB.       |           6|                3|        50 %|  Fail|
|    colC. |            6|                2|      33.3 %|  Fail||
+--------------+-------------+-----------------+------------+------+

I know how to compare the columns when using hardcoded column names , by my requirement is to compare it dynamically. What I did so far was to select a column from each data frame, but this doesn't seem the right way to do it.

 val columnsAll = df1.columns.map(m=>col(m))
 val df1_col1 = df1.select(df1.columns.slice(1,2).map(m=>col(m)):_*).as("Col1")
 val df2_col1 = df2.select(df2.columns.slice(1,2).map(m=>col(m)):_*).as("Col2")

Total Records is the number that is in common from both dataframes, in this case is 8. I compared both dataframes in number of rows, as "inner" and the total is 8. Now I need to compare each column and if there are differences, tio count them and I don't know how to compare the columns without using their names. — Dan, Feb 14 '22 at 08:20

score 0 · Answer 1 · answered Feb 13 '22 at 03:07

Two ways here:

The spark-fast-tests library has two methods for making DataFrame comparisons:

The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison

def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
  if (!actualDF.schema.equals(expectedDF.schema)) {
    throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
  }
  if (!actualDF.collect().sameElements(expectedDF.collect())) {
    throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
  }
}

Using df2.except(df1)

For details of these 2 ways, you can refer DataFrame equality in Apache Spark Compare two Spark dataframes

Thank you, that's a good starting point. – Dan Feb 14 '22 at 08:21 — Dan, Feb 14 '22 at 08:21

Spark - compare 2 dataframes without using hardcoded column names

1 Answers1