Spark window function aggregate by most frequent value in row

Question

I want to get the most frequent occurring String in a row,in a given window, and have this value in a new row. (am using Pyspark)

This is what my table looks like.

window    label    value
123         a        54
123         a        45
123         a        21
123         b        99
123         b        78

I'm doing some aggregation, and at the moment I'm grouping by both window and label.

sqlContext.sql(SELECT avg(value) as avgValue FROM table GROUP BY window, label)

This returns the average where window = 123 and label = a and the average where window = 123 and label = b

What I am trying to do, is order label by most frequently occurring string descending , so then in my sql statement I can do SELECT first(label) as majLabel, avg(value) as avgValue FROM table GROUP BY window

I'm trying to do this in a window function but am just not quite getting there.

group = ["window"]

w = (Window().partitionBy(*group))

@zero323 this seems like something you would know how to approach? Given your expertise with Spark's window functions :D — other15, Jul 13 '16 at 12:50

score 4 · Accepted Answer · edited May 27 '17 at 07:13

df = spark.createDataFrame([['123','a','54'],['123','a','45'],['123','a','21'],['123','b','99'],['123','b','78'],],['window','label','value'])

Define a right WindowSpec.

win_spec = window.partitionBy(['window','label']).orderBy(col('value').desc())

Returns a sequential number starting at 1 within a window partition ['window','label'].

str_rank = df.withColumn('string_rank',row_number().over(win_spec))
str_rank.show()

This is how the df looks now:

Select the window with "string_rank" == 1.

str_rank.where(col('string_rank')==1).drop('string_rank').show()

Spark window function aggregate by most frequent value in row

1 Answers1