Comparing two data frame with different number of columns in scala

Question

I have two data frame df1 and df2.

df1 have 174 columns and df2 have 175 columns.

How I can find which column is extra ?

You can refer : https://stackoverflow.com/questions/44338412/how-to-compare-two-dataframe-and-print-columns-that-are-different-in-scala — Karthikeyan Rasipalay Durairaj, Dec 28 '21 at 15:31
Thanks for solution but in my case number of columns in both data frame is different — Yogesh, Dec 28 '21 at 17:07

Alex Ott · Answer 1 · 2021-12-29T09:34:40.177

Just convert column lists into sets, and use diff operations on these sets, like this:

df2.columns.toSet.diff(df1.columns.toSet)

Please note that the order of comparison matters, like, df1.columns.toSet.diff(df2.columns.toSet) won't produce a required diff. If you want to have diff independent of position, you can use something like this:

df2.columns.toSet.diff(df1.columns.toSet).union(
  df1.columns.toSet.diff(df2.columns.toSet))

score 0 · Answer 2 · answered Dec 28 '21 at 17:48

In pyspark , You can use below logic .

dept = [("Finance",10), 
        ("Marketing",20), 
        ("Sales",30), 
        ("IT",40) 
      ]
deptColumns = ["dept_name","dept_id"]

dept1 = [("Finance",10,'999'), 
        ("Marketing",20,'999'), 
        ("Sales",30,'999'), 
        ("IT",40,'999') 
      ]
deptColumns1 = ["dept_name","dept_id","extracol"]

deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
dept1DF = spark.createDataFrame(data=dept1, schema = deptColumns1)
deptDF_columns=deptDF.schema.names
dept1DF_columns=dept1DF.schema.names

list_difference = []
for item in dept1DF_columns:
  if item not in deptDF_columns:
     list_difference.append(item)

print(list_difference)

Tested code :

You haven’t consider the scenario where `deptDF_columns` has an extra column. `list_difference = set(deptDF_columns) ^ set(dept1DF_columns)` should give you the difference in the 2 lists. — Syed Shahzer, Dec 28 '21 at 18:01
Thanks for your checking. I have considered that scenario too. Can you please check the line number 13 in screenshot — Karthikeyan Rasipalay Durairaj, Dec 28 '21 at 18:18

Comparing two data frame with different number of columns in scala

2 Answers2