Scala/Spark drop duplicates based in other column specific value

Question

I want to drop duplicates with same ID that not have a specific value in other column (in this case filter by those rows that have same ID and value = 1) Input df:

+---+-----+------+
| id|value|sorted|
+---+-----+------+
|  3|    0|     2|
|  3|    1|     3|
|  4|    0|     6|
|  4|    1|     5|
|  5|    4|     6|
+---+-----+------+

Result I want:

+---+-----+------+
| id|value|sorted|
+---+-----+------+
|  3|    1|     3|
|  4|    1|     5|
|  5|    4|     6|
+---+-----+------+

@user6910411 No, because that user wants to drop duplicates based on the max value of other column, and in this case is by a specific value — dataProcs, Oct 03 '18 at 10:47

score 1 · Accepted Answer · answered Oct 03 '18 at 11:03

1

Can be done by getting rows where values is "1", and then left join with orignal data:

val df = List(
  (3, 0, 2),
  (3, 1, 3),
  (4, 0, 6),
  (4, 1, 5),
  (5, 4, 6)
).toDF("id", "value", "sorted")

val withOne = df.filter($"value" === 1)
val joinedWithOriginal = df.alias("orig").join(withOne.alias("one"), Seq("id"), "left")
val result = joinedWithOriginal
  .where($"one.value".isNull || $"one.value" === $"orig.value")
  .select("orig.id", "orig.value", "orig.sorted")

result.show(false)

Output:

+---+-----+------+
|id |value|sorted|
+---+-----+------+
|3  |1    |3     |
|4  |1    |5     |
|5  |4    |6     |
+---+-----+------+

answered Oct 03 '18 at 11:03

pasha701

6,831
1
15
22

That works! What if I want to remove those rows which have a null value in other column with same ID? – dataProcs Oct 03 '18 at 14:18
Original Dataframe "df" can be filtered with column ".isNull" function, before "val withOne" statement. – pasha701 Oct 04 '18 at 07:05

Scala/Spark drop duplicates based in other column specific value

1 Answers1