I have millions of rows as dataframe like this:
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
Now I want to divide this data into three separate dataFrame like this:
- Only ACTIVE ids (like id2), say activeDF
- Only INACTIVE ids (like id3), say inactiveDF
- Having both ACTIVE and INACTIVE as status, say bothDF
How can I calculate activeDF and inactiveDF?
I know that bothDF can be calculated like
df.select("id").distinct.except(activeDF).except(inactiveDF)
, but this will involve shuffling (as 'distinct' operation required same). Is there any better way to calculate bothDF
Versions:
Spark : 2.2.1
Scala : 2.11