0
I need to create a way to compare 2 data frames and do not use any hardcoded data, so that I can upload at any time 2 files and compare them without changing anything.

df1
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB.           |colC      |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              20|     APPLE|
|(122C8984ABF9F6EF...|       0|              10|    GOOGLE|
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              15|   SAMSUNG|
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
+--------------------+--------+----------------+----------+


df2
+--------------------+--------+----------------+----------+
|                  ID|colA.   |colB            |colC      |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              20|     APPLE|
|(122C8984ABF9F6EF...|       0|              10|     APPLE|
|(122C8984ABF9F6EF...|       0|              30|     APPLE|
|(122C8984ABF9F6EF...|       0|              15|   SAMSUNG|
|(122C8984ABF9F6EF...|       0|              15|    GOOGLE|
+--------------------+--------+----------------+----------+

I need to compare these 2 data frames and count the differences from each column. My output should look like this:|

+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+------+
|      colA|            6|                0|       0.0 %|  Pass|
colB.       |           6|                3|        50 %|  Fail|
|    colC. |            6|                2|      33.3 %|  Fail||
+--------------+-------------+-----------------+------------+------+

I know how to compare the columns when using hardcoded column names , by my requirement is to compare it dynamically. What I did so far was to select a column from each data frame, but this doesn't seem the right way to do it.

 val columnsAll = df1.columns.map(m=>col(m))
 val df1_col1 = df1.select(df1.columns.slice(1,2).map(m=>col(m)):_*).as("Col1")
 val df2_col1 = df2.select(df2.columns.slice(1,2).map(m=>col(m)):_*).as("Col2")
Dan
  • 49
  • 5
  • Total number of records is what when you have 2 dataframes? – thebluephantom Feb 11 '22 at 16:30
  • Total Records is the number that is in common from both dataframes, in this case is 8. I compared both dataframes in number of rows, as "inner" and the total is 8. Now I need to compare each column and if there are differences, tio count them and I don't know how to compare the columns without using their names. – Dan Feb 14 '22 at 08:20
  • Well is therea positional depency, the key to be compared? – thebluephantom Feb 15 '22 at 08:46
  • The only column name that can be used is the ID column. – Dan Feb 16 '22 at 09:24

1 Answers1

0

Two ways here:

  1. The spark-fast-tests library has two methods for making DataFrame comparisons:

The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison

def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
  if (!actualDF.schema.equals(expectedDF.schema)) {
    throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
  }
  if (!actualDF.collect().sameElements(expectedDF.collect())) {
    throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
  }
}
  1. Using df2.except(df1)

For details of these 2 ways, you can refer DataFrame equality in Apache Spark Compare two Spark dataframes

Warren Zhu
  • 1,355
  • 11
  • 12