Max count with related values in spark

Question

I have a dataset having some colors and counts with related dates.

+-----------+----------+-----+
|      color|      Date|count|
+-----------+----------+-----+
|        red|2014-05-26|    5|
|        red|2014-05-02|    1|
|        red|2015-04-02|    1|
|        red|2015-04-26|    1|
|        red|2015-09-26|    2|
|       blue|2014-05-26|    3|
|       blue|2014-06-02|    1|
|      brown|2014-07-31|    2|
|      green|2014-08-01|    2|
+-----------+----------+-----+

I want max count for each colors with related dates. I am using Spark 2.0.2 with Java 8.

when I used max function then it removed date column and when I put date into groupBy then it gives same table as input dataset.

df.groupBy(color).max("count").show();

+-----------+----------+
|color      |max(count)|
+-----------+----------+
|        red|         5|
|       blue|         3|
|      brown|         2|
|      green|         2|
+-----------+----------+

Expected output:

+-----------+----------+----------+
|color      |      date|max(count)|
+-----------+----------+----------+
|        red|2014-05-26|         5|
|       blue|2014-05-26|         3|
|      brown|2014-07-31|         2|
|      green|2014-08-01|         2|
+-----------+----------+----------+

score 3 · Answer 1 · answered Nov 15 '17 at 08:40

The other answer is partially correct because first would return the first element of the group date generated by the grouping on color.

In the question, the max(count) is always at the first position for that particular color, hence the statement :

df.groupBy(color).agg(first("date").alias(date),max("count")).show();

returns correct results.

If your data was like this :

df.show
// red with highest count is at the bottom of the group
+-----+----------+-----+
|color|      date|count|
+-----+----------+-----+
|  red|2014-05-02|    1|
|  red|2015-04-02|    1|
|  red|2015-04-26|    1|
|  red|2015-09-26|    2|
|  red|2014-05-26|    5|
| blue|2014-05-26|    3|
| blue|2014-06-02|    1|
|brown|2014-07-31|    2|
|green|2014-08-01|    2|
+-----+----------+-----+

The query in question would return the first date in the group which corresponds to count 1 :

df.groupBy($"color").agg( first($"date").as("date") , max("count").as("count") ).show
+-----+----------+-----+
|color|      date|count|
+-----+----------+-----+
|brown|2014-07-31|    2|
|green|2014-08-01|    2|
| blue|2014-05-26|    3|
|  red|2014-05-02|    5|
+-----+----------+-----+

Alternatively, you can use window functions here :

import org.apache.spark.sql.expression.Window

val window = Window.partitionBy($"color").orderBy($"count".desc)

df.withColumn("rn", row_number over window).where($"rn" === 1 ).drop($"rn").show
+-----+----------+-----+
|color|      date|count|
+-----+----------+-----+
|brown|2014-07-31|    2|
|green|2014-08-01|    2|
| blue|2014-05-26|    3|
|  red|2014-05-26|    5|
+-----+----------+-----+

@dataScienctist This is just the method. We aren't here to write your code for you. You are supposed to take the answers as a reference and try to solve it further yourselves. Also, the Spark SQL API is quite similar for both Java and Scala. If you try, you'd realize. — philantrovert, Nov 21 '17 at 09:31

score -2 · Accepted Answer · edited Nov 15 '17 at 07:43

Here you want related column with max value so you can used agg() function for max value and first() for those related column.

df.groupBy(color).agg(first("date").alias(date),max("count")).show();   

+-----------+----------+----------+
|color      |      date|max(count)|
+-----------+----------+----------+
|        red|2014-05-26|         5|
|       blue|2014-05-26|         3|
|      brown|2014-07-31|         2|
|      green|2014-08-01|         2|
+-----------+----------+----------+

Max count with related values in spark

2 Answers2