Remove duplicates from PySpark array column

Question

I have a PySpark Dataframe that contains an ArrayType(StringType()) column. This column contains duplicate strings inside the array which I need to remove. For example, one row entry could look like [milk, bread, milk, toast]. Let's say my dataframe is named df and my column is named arraycol. I need something like:

df = df.withColumn("arraycol_without_dupes", F.remove_dupes_from_array("arraycol"))

My intution was that there exists a simple solution to this, but after browsing stackoverflow for 15 minutes I didn't find anything better than exploding the column, removing duplicates on the complete dataframe, then grouping again. There has got to be a simpler way that I just didn't think of, right?

I am using Spark version 2.4.0

why cant you do: `df = df.dropDuplicates(subset = ["arraycol"])` — YOLO, Jan 14 '19 at 16:49
@YOLO: The duplicates are inside the array in a single row... I will reformulate my question to be more precise. — Thomas, Jan 14 '19 at 16:52

pault · Accepted Answer · 2019-01-14T17:02:18.700

21

For pyspark version 2.4+, you can use pyspark.sql.functions.array_distinct:

from pyspark.sql.functions import array_distinct
df = df.withColumn("arraycol_without_dupes", array_distinct("arraycol"))

For older versions, you can do this with the API functions using explode + groupBy and collect_set, but a udf is probably more efficient here:

from pyspark.sql.functions import udf

remove_dupes_from_array = udf(lambda row: list(set(row)), ArrayType(StringType()))
df = df.withColumn("arraycol_without_dupes", remove_dupes_from_array("arraycol"))

edited Jan 14 '19 at 17:02

answered Jan 14 '19 at 16:47

pault

41,343
15
107
149

Thank you, this is exactly what I was looking for! I will see whether I can upgrade my cluster to 2.4.0. – Thomas Jan 14 '19 at 16:51
I was able to upgrade my cluster and can confirm that this does exactly what I need. – Thomas Jan 15 '19 at 08:41

Remove duplicates from PySpark array column

1 Answers1

Linked