What's the equivalent of Panda's value_counts() in PySpark?

Question

I am having the following python/pandas command:

df.groupby('Column_Name').agg(lambda x: x.value_counts().max()

where I am getting the value counts for ALL columns in a DataFrameGroupBy object.

How do I do this action in PySpark?

The task I am asking very simple. I want to get the vale counts (the highest distinct count) for all columns in a group by dataframe. This is easily done in Pandas with the value_counts() method. — TSAR, Jun 27 '18 at 13:40
Here is my DF:>>> schemaTrans.show() +----+----+------+-----+----+----+ |COL1|COL2| COL3| COL4|COL5| ID| +----+----+------+-----+----+----+ | 123| 456|ABC123| XYZ| 525|ID01| | 123| 456|ABC123| XYZ| 634|ID01| | 123| 456|ABC123| XYZ| 802|ID01| | 456| 123| BC01|K_L_M| 213|ID01| | 456| 123| BC01|K_L_M| 401|ID01| | 456| 123| BC01|P_Q_M| 213|ID01| | 123| 456|XYZ012| ABC| 117|ID02| | 123| 456|XYZ012| ABE| 117|ID02| | 456| 123| QPR12|S_T_U| 204|ID02| | 456| 123| QPR12|S_T_X| 415|ID02| +----+----+------+-----+----+----+ — TSAR, Jun 27 '18 at 14:20
from pyspark.sql.functions import count exprs = {x: "count" for x in schemaTrans.columns} schemaTrans.groupBy("ID").agg(exprs).show(5) +----+---------+-----------+-----------+-----------+-----------+-----------+ ID|count(ID)|count(COL4)|count(COL2)|count(COL3)|count(COL1)|count(COL5)| +----+---------+-----------+-----------+-----------+-----------+-----------+ |ID01| 6| 6| 6| 6| 6| 6| |ID02| 4| 4| 4| 4| 4| 4| +----+---------+-----------+-----------+-----------+-----------+--------- — TSAR, Jun 27 '18 at 14:23
exprs = [countDistinct(x) for x in schemaTrans.columns] schemaTrans.groupBy("ID").agg(*exprs).show(5) | ID|(DISTINCT COL1)|(DISTINCT COL2)|(DISTINCT COL3)|(DISTINCT COL4)|(DISTINCT COL5)|(DISTINCT ID)| +----+---------------+---------------+---------------+---------------+---------------+---------|ID01| 2 | 2 | 2 | 3 | 5 | 1 | |ID02| 2 | 2 | 2 | 4 | 3 | 1 | +----+---------------+---------------+---------------+---------------+---------------+--------- — TSAR, Jun 27 '18 at 14:28
But I would like to have: +----+----------+-----------+-----------+-----------+-----------+--------+ | ID|(VL COL1) | (VL COL2) | (VL COL3) | (VL COL4) | (VL COL5) | (VL ID)| +----+----------+-----------+-----------+-----------+-----------+--------+ |ID01| 3 | 3 | 3 | 3 | 2 | 1 | |ID02| 2 | 2 | 2 | 2 | 2 | 1 | +----+----------+-----------+-----------+-----------+-----------+--------+ — TSAR, Jun 27 '18 at 14:32
Please do not add these as comments. [Edit] your question and put it there. Please also read [how do I format my code blocks](https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks). Also check out [how to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Jun 27 '18 at 14:41

score 42 · Answer 1 · answered Jun 27 '18 at 13:29

42

It's more or less the same:

spark_df.groupBy('column_name').count().orderBy('count')

In the groupBy you can have multiple columns delimited by a ,

For example groupBy('column_1', 'column_2')

answered Jun 27 '18 at 13:29

Tanjin

2,442
1
13
20

Hi Tanjin, thank you for your reply! I am not getting the same result. I have been dong the following: (Action-1): from pyspark.sql.functions import count exprs = {x: "count" for x in df.columns} df.groupBy("ID").agg(exprs).show(5), this works but I am getting all the record count for each group. That's NOT what I want. (Action-2) from pyspark.sql.functions import countDistinct exprs = [countDistinct(x) for x in df.columns] df.groupBy("ID").agg(*exprs).show(5) This breaks!! It errors out as follows: ERROR client.TransportResponseHandler: – TSAR Jun 27 '18 at 13:34
3

The missing `.show()` that you need to add onto the end of that line to actually see the results might be confusing to beginners. – rer Mar 15 '21 at 15:52
4

To match the behavior in Pandas you want to return count by descending order: `spark_df.groupBy('column_name').count().orderBy(col('count').desc()).show()` – Bernard Apr 28 '21 at 13:44

score 14 · Answer 2 · answered Jan 05 '21 at 10:11

14

try this when you want to control the order:

data.groupBy('col_name').count().orderBy('count', ascending=False).show()

answered Jan 05 '21 at 10:11

s510

2,271
11
18

score 3 · Answer 3 · edited Jan 05 '21 at 13:44

3

Try this:

spark_df.groupBy('column_name').count().show()

edited Jan 05 '21 at 13:44

Adrian Mole

49,934
160
51
83

answered Aug 16 '20 at 13:42

Kierk

476
6
23

go-deeper-in-data · Answer 4 · 2021-11-23T10:52:05.650

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)    
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()

but spark is much slower than pandas value_counts() on a single machine

score 0 · Answer 5 · answered Mar 03 '22 at 05:59

0

df.groupBy('column_name').count().orderBy('count').show()

answered Mar 03 '22 at 05:59

Ankit Kumar

11
1

2

Please read "[answer]". It helps more if you supply an explanation why this is the preferred solution and explain how it works. We want to educate, not just provide code. – the Tin Man Mar 11 '22 at 05:56

What's the equivalent of Panda's value_counts() in PySpark?

5 Answers5