1

I am trying to drop duplicated rows based on the column id. How can i get the dropped data which have duplicate "id"? This is the code that I've been working on for now.

val datatoBeInserted = data.select("id", "is_enabled", "code", "description", "gamme", "import_local", "marque", "type_marketing", "reference", "struct", "type_tarif", "family_id", "range_id", "article_type_id")
val cleanedData = datatoBeInserted.dropDuplicates("id")

Using the above query, cleanedData will give all rows without duplicates of "id". Now, I want to figure out which rows have been filtered out because of duplicates.

P̲̳x͓L̳
  • 3,615
  • 3
  • 29
  • 37
Maher HTB
  • 737
  • 3
  • 9
  • 23

1 Answers1

2

You can use the below code to find the data which is dropped

val datatoBeInserted = data.select("id", "is_enabled", "code", "description", "gamme", "import_local", "marque", "type_marketing", "reference", "struct", "type_tarif", "family_id", "range_id", "article_type_id")

val cleanedData = datatoBeInserted.dropDuplicates("id")

val droppedData = datatoBeInserted.except(cleanedData)

All the best :)

maxmithun
  • 1,089
  • 9
  • 18
  • thanks , i have already tried it , but it takes too much time for huge data thanks , is there any other solution – Maher HTB Aug 30 '17 at 10:24
  • @MaherHTB The problem is that when looking if there's a column with the same id in cleanedData, it could be anywhere. So the operation has to do an enormous amount of data shuffling => slow – The Archetypal Paul Aug 31 '17 at 19:39