Pyspark - groupby([col list]).agg(count([col list))

Question

How can I achieve this ?

from pyspark.sql import functions as F
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
grouped=df.groupby([col list]).agg(F.count([col list]))

I've read the similar questions on stackoverflow but could not find the exact answer.

Even if I try to put a single column

grouped=dfn.groupby('col name').agg(F.count('col name'))

I get -

py4j\java_collections.py", line 500, in convert for element in object: TypeError: 'type' object is not iterable

Reference to question - pyspark Column is not iterable

I don't know the column names beforehand and need to provide list as input to the group by agg functions.

score 1 · Accepted Answer · answered Feb 26 '20 at 17:53

You can simply use .count() method on GroupedData object.

Let's prepare some data (I assume you have SparkSession object available under spark variable)

>>> import pandas as pd
>>>
>>> pdf = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
>>> df = spark.createDataFrame(pdf)
>>> df.show(5)

+------------+-----------+------------+-----------+-------+
|sepal_length|sepal_width|petal_length|petal_width|species|
+------------+-----------+------------+-----------+-------+
|         5.1|        3.5|         1.4|        0.2| setosa|
|         4.9|        3.0|         1.4|        0.2| setosa|
|         4.7|        3.2|         1.3|        0.2| setosa|
|         4.6|        3.1|         1.5|        0.2| setosa|
|         5.0|        3.6|         1.4|        0.2| setosa|
+------------+-----------+------------+-----------+-------+
only showing top 5 rows

Then simply use groupBy(*cols) method on desired column(s) in DataFrame.

>>> grouped = df.groupBy(['petal_width', 'species']).count()
>>> grouped.show(5)

+-----------+----------+-----+
|petal_width|   species|count|
+-----------+----------+-----+
|        1.7| virginica|    1|
|        2.2| virginica|    3|
|        1.8| virginica|   11|
|        1.9| virginica|    5|
|        1.5|versicolor|   10|
+-----------+----------+-----+
only showing top 5 rows

Pyspark - groupby([col list]).agg(count([col list))

1 Answers1