I have a 'Dataset' in Java Spark related to cabs of a city, that among its several columns, it has :
day
in the form2016-04-02
, which is the day that the cab picked up a customer.vendor_id
, which is for example1
.hour
in the form form of2
or16
.
I want to get the hour that each vendor, each day had the maximum number of customers. So, I think I should GroupBy
on these three columns. What I get after GroupBy
is
first 2 rows after I groupBy on day, vendor_id, hour :
+----------+---------+----+-----+
|day |vendor_id|hour|count|
+----------+---------+----+-----+
|2016-01-01|1 |2 |116 |
|2016-01-01|1 |1 |110 |
+----------+---------+----+-----+
How can I get the hour of each day of each vendor (the groups created by GroupBy
) with the maximum count?
I have already seen that this is solved with join, but this and other examples grouped only on one column where here I grouped on three.
If possible, I prefer Java code that uses Spark libraries, thank you for your time.