2

I am well familiar with Pandas data frame where I use function "mode" and "groupby" to get most frequent values,like below

df3=df5.groupby(['band']).apply(lambda x: x.mode())

however I am facing some difficulties to get in PySpark.

I have a spark data frame as follows:

band      A3    A5  status
4G_band1800 12  18  TRUE
4G_band1800 12  18  FALSE
4G_band1800 10  18  TRUE
4G_band1800 12  12  TRUE
4g_band2300 6   24  FALSE
4g_band2300 6   22  FALSE
4g_band2300 6   24  FALSE
4g_band2300 3   24  TRUE

Screenshot of above

What I want is as follows:

band      A3    A5  status
4G_band1800 12  18  TRUE
4g_band2300 6   24  FALSE

Screenshot of above

I have tried all possible combinations but haven't got any reasonable output. Please suggest a way.

Python Spark
  • 303
  • 2
  • 6
  • 16

1 Answers1

11

Without defining your own UDAF, you might define a mode function (udf) and use it with collect_list as follows:

import pyspark.sql.functions as F
@F.udf
def mode(x):
    from collections import Counter
    return Counter(x).most_common(1)[0][0]

cols = ['A3', 'A5', 'status']
agg_expr = [mode(F.collect_list(col)).alias(col) for col in cols]
df.groupBy('band').agg(*agg_expr).show()

+-----------+---+---+------+
|       band| A3| A5|status|
+-----------+---+---+------+
|4G_band1800| 12| 18|  true|
|4g_band2300|  6| 24| false|
+-----------+---+---+------+
Psidom
  • 209,562
  • 33
  • 339
  • 356