pyspark dataframe: remove duplicates in an array column

Question

I would like to remove some duplicated words in a column of pyspark dataframe.

based on Remove duplicates from PySpark array column

My Spark:

  2.4.5

Py3 code:

  test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
  t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.

  t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
  t5 = t4.withColumn('text', F.array_distinct("text"))
  t5.show(1, 120)

but got

 +--------------------------------------------------------+
 |                                                    text| 
 +--------------------------------------------------------+
 |[i like this book and this book be downloaded on line]|
 +--------------------------------------------------------+

I need to remove

 book and this

It seems that the "array_distinct" cannot filter them out ?

thanks

Do have a look into the given link. It might be helpful: https://stackoverflow.com/questions/47316783/python-dataframe-remove-duplicate-words-in-the-same-cell-within-a-column-in-pyt — Muhammad Hamza Sabir, Sep 15 '20 at 05:11
`and` is not duplicated anywhere in the string. So based on what do you want to remove it? Or do you mean `book` and `this`? Can you show your desired final result? — kfkhalili, Sep 15 '20 at 07:29
it won't filter out anything because it's just an array of single string and not multiple strings so array_distinct just find one string in array. I assume you need to remove duplicate words from the string and not from the array of strings. Is this correct? — Frosty, Sep 15 '20 at 08:58

A.B · Answer 1 · 2020-10-01T23:17:12.963

You can use lcase , split , array_distinct and array_join functions from pyspark sql.functions

For example, F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")

Here is working code

import pyspark.sql.functions as F
df
.withColumn("text_new",
   F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) \
.show(truncate=False)

Explaination:

Here, you first convert everthing to lower case with lcase(text) than split the array on whitespace with split(text,' '), which produces

[i, like, this, book, and, this, book, be, downloaded, on, line]|

then you pass this toarray_distinct, which produces

[i, like, this, book, and, be, downloaded, on, line]

and finally, join it with whitespace using array_join

i like this book and be downloaded on line

pyspark dataframe: remove duplicates in an array column

1 Answers1