49

I am having the following python/pandas command:

df.groupby('Column_Name').agg(lambda x: x.value_counts().max()

where I am getting the value counts for ALL columns in a DataFrameGroupBy object.

How do I do this action in PySpark?

Machavity
  • 30,841
  • 27
  • 92
  • 100
TSAR
  • 683
  • 1
  • 6
  • 8
  • The task I am asking very simple. I want to get the vale counts (the highest distinct count) for all columns in a group by dataframe. This is easily done in Pandas with the value_counts() method. – TSAR Jun 27 '18 at 13:40
  • Here is my DF:>>> schemaTrans.show() +----+----+------+-----+----+----+ |COL1|COL2| COL3| COL4|COL5| ID| +----+----+------+-----+----+----+ | 123| 456|ABC123| XYZ| 525|ID01| | 123| 456|ABC123| XYZ| 634|ID01| | 123| 456|ABC123| XYZ| 802|ID01| | 456| 123| BC01|K_L_M| 213|ID01| | 456| 123| BC01|K_L_M| 401|ID01| | 456| 123| BC01|P_Q_M| 213|ID01| | 123| 456|XYZ012| ABC| 117|ID02| | 123| 456|XYZ012| ABE| 117|ID02| | 456| 123| QPR12|S_T_U| 204|ID02| | 456| 123| QPR12|S_T_X| 415|ID02| +----+----+------+-----+----+----+ – TSAR Jun 27 '18 at 14:20
  • from pyspark.sql.functions import count exprs = {x: "count" for x in schemaTrans.columns} schemaTrans.groupBy("ID").agg(exprs).show(5) +----+---------+-----------+-----------+-----------+-----------+-----------+ ID|count(ID)|count(COL4)|count(COL2)|count(COL3)|count(COL1)|count(COL5)| +----+---------+-----------+-----------+-----------+-----------+-----------+ |ID01| 6| 6| 6| 6| 6| 6| |ID02| 4| 4| 4| 4| 4| 4| +----+---------+-----------+-----------+-----------+-----------+--------- – TSAR Jun 27 '18 at 14:23
  • exprs = [countDistinct(x) for x in schemaTrans.columns] schemaTrans.groupBy("ID").agg(*exprs).show(5) | ID|(DISTINCT COL1)|(DISTINCT COL2)|(DISTINCT COL3)|(DISTINCT COL4)|(DISTINCT COL5)|(DISTINCT ID)| +----+---------------+---------------+---------------+---------------+---------------+---------|ID01| 2 | 2 | 2 | 3 | 5 | 1 | |ID02| 2 | 2 | 2 | 4 | 3 | 1 | +----+---------------+---------------+---------------+---------------+---------------+--------- – TSAR Jun 27 '18 at 14:28
  • But I would like to have: +----+----------+-----------+-----------+-----------+-----------+--------+ | ID|(VL COL1) | (VL COL2) | (VL COL3) | (VL COL4) | (VL COL5) | (VL ID)| +----+----------+-----------+-----------+-----------+-----------+--------+ |ID01| 3 | 3 | 3 | 3 | 2 | 1 | |ID02| 2 | 2 | 2 | 2 | 2 | 1 | +----+----------+-----------+-----------+-----------+-----------+--------+ – TSAR Jun 27 '18 at 14:32
  • 2
    Please do not add these as comments. [Edit] your question and put it there. Please also read [how do I format my code blocks](https://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks). Also check out [how to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Jun 27 '18 at 14:41

5 Answers5

42

It's more or less the same:

spark_df.groupBy('column_name').count().orderBy('count')

In the groupBy you can have multiple columns delimited by a ,

For example groupBy('column_1', 'column_2')

Tanjin
  • 2,442
  • 1
  • 13
  • 20
  • Hi Tanjin, thank you for your reply! I am not getting the same result. I have been dong the following: (Action-1): from pyspark.sql.functions import count exprs = {x: "count" for x in df.columns} df.groupBy("ID").agg(exprs).show(5), this works but I am getting all the record count for each group. That's NOT what I want. (Action-2) from pyspark.sql.functions import countDistinct exprs = [countDistinct(x) for x in df.columns] df.groupBy("ID").agg(*exprs).show(5) This breaks!! It errors out as follows: ERROR client.TransportResponseHandler: – TSAR Jun 27 '18 at 13:34
  • 3
    The missing `.show()` that you need to add onto the end of that line to actually see the results might be confusing to beginners. – rer Mar 15 '21 at 15:52
  • 4
    To match the behavior in Pandas you want to return count by descending order: `spark_df.groupBy('column_name').count().orderBy(col('count').desc()).show()` – Bernard Apr 28 '21 at 13:44
14

try this when you want to control the order:

data.groupBy('col_name').count().orderBy('count', ascending=False).show()
s510
  • 2,271
  • 11
  • 18
3

Try this:

spark_df.groupBy('column_name').count().show()
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
Kierk
  • 476
  • 6
  • 23
0
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, desc
spark = SparkSession.builder.appName('whatever_name').getOrCreate()
spark_sc = spark.read.option('header', True).csv(your_file)    
value_counts=spark_sc.select('Column_Name').groupBy('Column_Name').agg(count('Column_Name').alias('counts')).orderBy(desc('counts'))
value_counts.show()

but spark is much slower than pandas value_counts() on a single machine

0

df.groupBy('column_name').count().orderBy('count').show()

  • 2
    Please read "[answer]". It helps more if you supply an explanation why this is the preferred solution and explain how it works. We want to educate, not just provide code. – the Tin Man Mar 11 '22 at 05:56