How to use pyspark to find whether a column contains one or more words in it's string sentence

Question

I have a dataset that looks like this

I am trying to flag or filter the rows which contain words from my list using pyspark

The reference list looks like ['house','tree']

So essentially it should return the first and third row. It should return the second row as trees is spelt with s on the end. I only want whole word matches.

My idea is to string split the string column, the loop across the reference list, is there a better way?

Can you specify the spark version you are using, you can do a split and use array_contains function for each value in the reference_list to achieve the same which will handle ordering of reference_list entries.you can try regex as well but in this case ordering of the reference list items will impact the result — HArdRe537, Oct 28 '20 at 09:00
have a look into this solution - Similar type of question https://stackoverflow.com/questions/64566730/remove-words-from-pyspark-dataframe-based-on-words-from-another-pyspark-datafram/64567386#64567386 — dsk, Oct 28 '20 at 10:33

dsk · Answer 1 · 2020-10-28T10:35:13.230

This can be a working solution for you - use higher order function array_contains() instead of loop through every item, however in order to implement the solution we need to streamline a little bit. such as need to make the string column as as an Array

Create the DataFrame Here

from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([(1,"This is a Horse"),(2,"Monkey Loves trees"),(3,"House has a tree"),(4,"The Ocean is Cold")],[ "col1","col2"])
df.show(truncate=False)

Output

+----+-----------------+
|col1|col2             |
+----+-----------------+
|1   |This is a Horse  |
|2   |Monkey Loves trees|
|3   |House has a tree |
|4   |The Ocean is Cold|
+----+-----------------+

Logic Here - convert the string column as ArrayType by using split()

df = df.withColumn("col2", F.split("col2", " "))
df = df.withColumn("array_filter", F.when(F.array_contains("col2", "This"), True).when(F.array_contains("col2", "tree"), True))
df = df.filter(F.col("array_filter") == True)
df.show(truncate=False)

Output

   +----+---------------------+------------+
|col1|col2                 |array_filter|
+----+---------------------+------------+
|1   |[This, is, a, Horse] |true        |
|3   |[House, has, a, tree]|true        |
+----+---------------------+------------+

How to use pyspark to find whether a column contains one or more words in it's string sentence

1 Answers1

Create the DataFrame Here

Output

Logic Here - convert the string column as ArrayType by using split()

Output

Linked