I need to create a way to compare 2 data frames and do not use any hardcoded data, so that I can upload at any time 2 files and compare them without changing anything.
df1
+--------------------+--------+----------------+----------+
| ID|colA. |colB. |colC |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 20| APPLE|
|(122C8984ABF9F6EF...| 0| 10| GOOGLE|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 15| SAMSUNG|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
+--------------------+--------+----------------+----------+
df2
+--------------------+--------+----------------+----------+
| ID|colA. |colB |colC |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 20| APPLE|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 30| APPLE|
|(122C8984ABF9F6EF...| 0| 15| SAMSUNG|
|(122C8984ABF9F6EF...| 0| 15| GOOGLE|
+--------------------+--------+----------------+----------+
I need to compare these 2 data frames and count the differences from each column. My output should look like this:|
+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+------+
| colA| 6| 0| 0.0 %| Pass|
colB. | 6| 3| 50 %| Fail|
| colC. | 6| 2| 33.3 %| Fail||
+--------------+-------------+-----------------+------------+------+
I know how to compare the columns when using hardcoded column names , by my requirement is to compare it dynamically. What I did so far was to select a column from each data frame, but this doesn't seem the right way to do it.
val columnsAll = df1.columns.map(m=>col(m))
val df1_col1 = df1.select(df1.columns.slice(1,2).map(m=>col(m)):_*).as("Col1")
val df2_col1 = df2.select(df2.columns.slice(1,2).map(m=>col(m)):_*).as("Col2")