Below is the sample dataset representing the employees in_date and out_date. I have to obtain the last in_time of all employees.
Spark is running on 4 Node standalone cluster.
Initial Dataset:
EmployeeID-----in_date-----out_date
1111111 2017-04-20 2017-09-14
1111111 2017-11-02 null
2222222 2017-09-26 2017-09-26
2222222 2017-11-28 null
3333333 2016-01-07 2016-01-20
3333333 2017-10-25 null
Dataset after df.sort(col(in_date).desc())
:
EmployeeID--in_date-----out_date
1111111 2017-11-02 null
1111111 2017-04-20 2017-09-14
2222222 2017-09-26 2017-09-26
2222222 2017-11-28 null
3333333 2017-10-25 null
3333333 2016-01-07 2016-01-20
df.dropDup(EmployeeID):
Output :
EmployeeID-----in_date-----out_date
1111111 2017-11-02 null
2222222 2017-09-26 2017-09-26
3333333 2016-01-07 2016-01-20
Expected Dataset :
EmployeeID-----in_date-----out_date
1111111 2017-11-02 null
2222222 2017-11-28 null
3333333 2017-10-25 null
but when I sorted the Initial Dataset with sortWithInPartitions
and deduped I got the expected dataset.
Am I missing anything big or small here? Any help is appreciated.
Additional Information :
The above expected output was achieved when df.sort was executed with Spark in local mode.
I've not done any kind of partition, repartition.
The initial dataset is obtained from the underlying Cassandra database.