Remove duplicates in spark with 90 percent column match

Asked Jul 05 '20 at 17:47

Active Jul 10 '20 at 06:57

Viewed 103 times

Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?

Name      Country   City    Married Salary
Tony      India    Delhi    Yes 30000
Carol     USA      Chicago  Yes 35000
Shuaib    France   Paris    No  25000
Dimitris  Spain    Madrid   No  28000
Richard   Italy    Milan    Yes 32000
Adam      Portugal Lisbon   Yes 36000
Tony      India    Delhi    Yes 22000  <--
Carol     USA      Chicago  Yes 21000  <--
Shuaib    France   Paris    No  20000  <--

Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA

edited Jul 10 '20 at 06:57

asked Jul 05 '20 at 17:47

MithunK07

1

Welcome to SO, It always recommended to add details like sample dataset and error you are facing to reproduce your problem. You can refer https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples . It would help you to ask questions in efficient way. – Shantanu Sharma Jul 05 '20 at 18:37
should the comparison be for just two consecutive rows defined by an order or all the rows defined by a group – Raghu Jul 05 '20 at 19:18

Remove duplicates in spark with 90 percent column match

0 Answers0