get duplicated rows based on column spark dataframe

Question

I am trying to drop duplicated rows based on the column id. How can i get the dropped data which have duplicate "id"? This is the code that I've been working on for now.

val datatoBeInserted = data.select("id", "is_enabled", "code", "description", "gamme", "import_local", "marque", "type_marketing", "reference", "struct", "type_tarif", "family_id", "range_id", "article_type_id")
val cleanedData = datatoBeInserted.dropDuplicates("id")

Using the above query, cleanedData will give all rows without duplicates of "id". Now, I want to figure out which rows have been filtered out because of duplicates.

@RameshMaharjan Of this: https://stackoverflow.com/questions/29537564/spark-subtract-two-dataframes — philantrovert, Aug 30 '17 at 08:55

score 2 · Answer 1 · answered Aug 30 '17 at 08:58

2

You can use the below code to find the data which is dropped

val datatoBeInserted = data.select("id", "is_enabled", "code", "description", "gamme", "import_local", "marque", "type_marketing", "reference", "struct", "type_tarif", "family_id", "range_id", "article_type_id")

val cleanedData = datatoBeInserted.dropDuplicates("id")

val droppedData = datatoBeInserted.except(cleanedData)

All the best :)

answered Aug 30 '17 at 08:58

maxmithun

1,089
9
18

thanks , i have already tried it , but it takes too much time for huge data thanks , is there any other solution – Maher HTB Aug 30 '17 at 10:24
@MaherHTB The problem is that when looking if there's a column with the same id in cleanedData, it could be anywhere. So the operation has to do an enormous amount of data shuffling => slow – The Archetypal Paul Aug 31 '17 at 19:39

get duplicated rows based on column spark dataframe

1 Answers1