Pyspark, find substring as whole word(s)

Question

I would like to see if a string column is contained in another column as a whole word. There are few approaches like using contains as described here or using array_contains as described here.

The first approach fails in the following edge case:

+---------+-----------------------+
|candidate| sentence              |
+---------+-----------------------+
|  su     |We saw the survivors.  |
+---------+-----------------------+

su should be found as a separate word and not as a pure substring of the sentence column.

The second approach fails when the candidate is a compound word. An example is:

+----------------+------------------------+
|candidate       | sentence               |
+----------------+------------------------+
|  Roman emperor | He was a Roman emperor.|
+----------------+------------------------+

The second approach fails here because it turns the sentence column to an array of tokens: [He, was, a, Roman, emperor] and none of them is equal to Roman emperor.

Is there any way to resolve this issue?

I would think that you would have to do 2 things. First split up the sentence into an array of words and second check if your candidate is in the splitted sentence using e.g `isin()`. — k88, Jun 08 '22 at 19:51
@k88 that's the second link I mentioned above but it doesn't work because "Roman emperor" are two words. — A.M., Jun 08 '22 at 20:02
Right right. Depending on the complexity of your `candidate`, this ventures into NLP. If so, I would recommend looking into similarity computations, e.g. https://neuml.github.io/txtai/pipeline/text/similarity/ — k88, Jun 08 '22 at 20:47

score 3 · Accepted Answer · answered Jun 08 '22 at 20:21

This probably still has edge cases but I hope you get some ideas. I would use regex_extract to match the candidate against the sentence.

First, I convert the candidate to regex (ie, convert space to \s), then use regex_extract with word boundary (\b).

df = (df.withColumn('regex', F.regexp_replace(F.col('candidate'), ' ', '\\\s'))
      .withColumn('match', F.expr(r"regexp_extract(sentence, concat('\\b', regex, '\\b'), 0)")))

Result

+-------------+-----------------------+--------------+-------------+
|    candidate|               sentence|         regex|        match|
+-------------+-----------------------+--------------+-------------+
|           su|  We saw the survivors.|            su|             |
|Roman emperor|He was a Roman emperor.|Roman\semperor|Roman emperor|
+-------------+-----------------------+--------------+-------------+

Thanks for this, it did work. On a larger set, this is slow, however. — pnv, Aug 23 '23 at 19:43
@pnv that I think you need to optimize on resource side. # of executors, size, CPU etc. — Emma, Aug 24 '23 at 15:39

Pyspark, find substring as whole word(s)

1 Answers1