1

I have a dataset that looks like this

enter image description here

I am trying to flag or filter the rows which contain words from my list using pyspark

The reference list looks like ['house','tree']

So essentially it should return the first and third row. It should return the second row as trees is spelt with s on the end. I only want whole word matches.

My idea is to string split the string column, the loop across the reference list, is there a better way?

shecode
  • 1,716
  • 6
  • 32
  • 50
  • Can you specify the spark version you are using, you can do a split and use array_contains function for each value in the reference_list to achieve the same which will handle ordering of reference_list entries.you can try regex as well but in this case ordering of the reference list items will impact the result – HArdRe537 Oct 28 '20 at 09:00
  • have a look into this solution - Similar type of question https://stackoverflow.com/questions/64566730/remove-words-from-pyspark-dataframe-based-on-words-from-another-pyspark-datafram/64567386#64567386 – dsk Oct 28 '20 at 10:33

1 Answers1

2

This can be a working solution for you - use higher order function array_contains() instead of loop through every item, however in order to implement the solution we need to streamline a little bit. such as need to make the string column as as an Array

Create the DataFrame Here

from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([(1,"This is a Horse"),(2,"Monkey Loves trees"),(3,"House has a tree"),(4,"The Ocean is Cold")],[ "col1","col2"])
df.show(truncate=False)

Output

+----+-----------------+
|col1|col2             |
+----+-----------------+
|1   |This is a Horse  |
|2   |Monkey Loves trees|
|3   |House has a tree |
|4   |The Ocean is Cold|
+----+-----------------+

Logic Here - convert the string column as ArrayType by using split()

df = df.withColumn("col2", F.split("col2", " "))
df = df.withColumn("array_filter", F.when(F.array_contains("col2", "This"), True).when(F.array_contains("col2", "tree"), True))
df = df.filter(F.col("array_filter") == True)
df.show(truncate=False)

Output

   +----+---------------------+------------+
|col1|col2                 |array_filter|
+----+---------------------+------------+
|1   |[This, is, a, Horse] |true        |
|3   |[House, has, a, tree]|true        |
+----+---------------------+------------+
dsk
  • 1,863
  • 2
  • 10
  • 13