I want to get the most frequent occurring String in a row,in a given window, and have this value in a new row. (am using Pyspark)
This is what my table looks like.
window label value
123 a 54
123 a 45
123 a 21
123 b 99
123 b 78
I'm doing some aggregation, and at the moment I'm grouping by both window
and label
.
sqlContext.sql(SELECT avg(value) as avgValue FROM table GROUP BY window, label)
This returns the average where window = 123 and label = a and the average where window = 123 and label = b
What I am trying to do, is order label
by most frequently occurring string descending , so then in my sql statement I can do SELECT first(label) as majLabel, avg(value) as avgValue FROM table GROUP BY window
I'm trying to do this in a window function but am just not quite getting there.
group = ["window"]
w = (Window().partitionBy(*group))