2

I want to get the most frequent occurring String in a row,in a given window, and have this value in a new row. (am using Pyspark)

This is what my table looks like.

window    label    value
123         a        54
123         a        45
123         a        21
123         b        99
123         b        78

I'm doing some aggregation, and at the moment I'm grouping by both window and label.

sqlContext.sql(SELECT avg(value) as avgValue FROM table GROUP BY window, label)

This returns the average where window = 123 and label = a and the average where window = 123 and label = b

What I am trying to do, is order label by most frequently occurring string descending , so then in my sql statement I can do SELECT first(label) as majLabel, avg(value) as avgValue FROM table GROUP BY window

I'm trying to do this in a window function but am just not quite getting there.

group = ["window"]

w = (Window().partitionBy(*group))
Community
  • 1
  • 1
other15
  • 839
  • 2
  • 11
  • 23
  • @zero323 this seems like something you would know how to approach? Given your expertise with Spark's window functions :D – other15 Jul 13 '16 at 12:50

1 Answers1

4
df = spark.createDataFrame([['123','a','54'],['123','a','45'],['123','a','21'],['123','b','99'],['123','b','78'],],['window','label','value'])

Define a right WindowSpec.

win_spec = window.partitionBy(['window','label']).orderBy(col('value').desc())

Returns a sequential number starting at 1 within a window partition ['window','label'].

str_rank = df.withColumn('string_rank',row_number().over(win_spec))
str_rank.show()

This is how the df looks now:

Select the window with "string_rank" == 1.

str_rank.where(col('string_rank')==1).drop('string_rank').show()

Nathan Tuggy
  • 2,237
  • 27
  • 30
  • 38
Wong Tat Yau
  • 953
  • 11
  • 11