Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?
Name Country City Married Salary
Tony India Delhi Yes 30000
Carol USA Chicago Yes 35000
Shuaib France Paris No 25000
Dimitris Spain Madrid No 28000
Richard Italy Milan Yes 32000
Adam Portugal Lisbon Yes 36000
Tony India Delhi Yes 22000 <--
Carol USA Chicago Yes 21000 <--
Shuaib France Paris No 20000 <--
Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA