pyspark dataframe groupby with aggregate unique values

Question

I looked up for any reference for pyspark equivalent of pandas df.groupby(upc)['store'].unique() where df is any dataframe in pandas.

Please use this piece of code for data frame creation in Pyspark

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql import *
from datetime import date
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data2 = [("36636","M",3000),
    ("40288","M",4000),
    ("42114","M",3000),
    ("39192","F",4000),
    ("39192","F",2000)
  ]

schema = StructType([ \
    StructField("upc", StringType(), True), \
    StructField("store", StringType(), True), \
    StructField("sale", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

I know pyspark groupby unique_count, but need help with unique_values

Do look at [`collect_set`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_set.html) aggregation in pyspark. — Nithish, Dec 13 '21 at 14:08
Does this answer your question? [pyspark collect\_set or collect\_list with groupby](https://stackoverflow.com/questions/37580782/pyspark-collect-set-or-collect-list-with-groupby) — Rahul Kumar, Dec 13 '21 at 15:49
yes, it is similar question but author has rephrased the question differently. — Dileep Kumar, Dec 13 '21 at 16:44

score 0 · Answer 1 · answered Dec 13 '21 at 14:12

You can apply collect_set aggregation to collect unique values in a column. Note that collect_set ignores null values.

df.groupBy("upc").agg(F.collect_set("store")).show()

Output

+-----+------------------+
|  upc|collect_set(store)|
+-----+------------------+
|42114|               [M]|
|40288|               [M]|
|39192|               [F]|
|36636|               [M]|
+-----+------------------+

Ajay Chinni · Accepted Answer · 2021-12-13T14:18:29.250

0

You can use collect_set to get unique values

from pyspark.sql import functions as F
from pyspark.sql.functions import col
df_group = df.groupBy('upc').agg(F.collect_set(col('store')))

edited Dec 13 '21 at 14:18

answered Dec 13 '21 at 14:12

Ajay Chinni

780
1
6
24

pyspark dataframe groupby with aggregate unique values

2 Answers2

Output