how to get most frequent values of a dataframe in PySpark

Question

I am well familiar with Pandas data frame where I use function "mode" and "groupby" to get most frequent values,like below

df3=df5.groupby(['band']).apply(lambda x: x.mode())

however I am facing some difficulties to get in PySpark.

I have a spark data frame as follows:

band      A3    A5  status
4G_band1800 12  18  TRUE
4G_band1800 12  18  FALSE
4G_band1800 10  18  TRUE
4G_band1800 12  12  TRUE
4g_band2300 6   24  FALSE
4g_band2300 6   22  FALSE
4g_band2300 6   24  FALSE
4g_band2300 3   24  TRUE

Screenshot of above

What I want is as follows:

band      A3    A5  status
4G_band1800 12  18  TRUE
4g_band2300 6   24  FALSE

Screenshot of above

I have tried all possible combinations but haven't got any reasonable output. Please suggest a way.

can you share your input data in a format that others can comprehend? — mtoto, Aug 25 '17 at 11:14
hi, i updated the question using images.. image one is input of data frame and image 2 what i want in output — Python Spark, Aug 25 '17 at 11:21
Sorry for inconvenience .. now its updated with proper data frame. — Python Spark, Aug 25 '17 at 11:33
the column "band" has two unique different values as 4g_band2300 and 4g_band1800 and it has multiple values for other columns. i need most frequent values of those column required for this two bands. in Pandas i generally use mode function and groupby function like following: `df3=df5.groupby(['band']).apply(lambda x: x.mode())` — Python Spark, Aug 25 '17 at 11:39
see here: https://stackoverflow.com/a/36695251/4964651 it's gunna be quite expensive tho. — mtoto, Aug 25 '17 at 12:08

score 11 · Accepted Answer · answered Aug 25 '17 at 12:39

Without defining your own UDAF, you might define a mode function (udf) and use it with collect_list as follows:

import pyspark.sql.functions as F
@F.udf
def mode(x):
    from collections import Counter
    return Counter(x).most_common(1)[0][0]

cols = ['A3', 'A5', 'status']
agg_expr = [mode(F.collect_list(col)).alias(col) for col in cols]
df.groupBy('band').agg(*agg_expr).show()

+-----------+---+---+------+
|       band| A3| A5|status|
+-----------+---+---+------+
|4G_band1800| 12| 18|  true|
|4g_band2300|  6| 24| false|
+-----------+---+---+------+

Works fine for string/number types but not for date type – Ilyas Jan 08 '21 at 17:10 — Ilyas, Jan 08 '21 at 17:10

how to get most frequent values of a dataframe in PySpark

1 Answers1