I would like to remove some duplicated words in a column of pyspark dataframe.
based on Remove duplicates from PySpark array column
My Spark:
2.4.5
Py3 code:
test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.
t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
t5 = t4.withColumn('text', F.array_distinct("text"))
t5.show(1, 120)
but got
+--------------------------------------------------------+
| text|
+--------------------------------------------------------+
|[i like this book and this book be downloaded on line]|
+--------------------------------------------------------+
I need to remove
book and this
It seems that the "array_distinct" cannot filter them out ?
thanks