I would like to see if a string column is contained in another column as a whole word. There are few approaches like using contains
as described here or using array_contains
as described here.
The first approach fails in the following edge case:
+---------+-----------------------+
|candidate| sentence |
+---------+-----------------------+
| su |We saw the survivors. |
+---------+-----------------------+
su
should be found as a separate word and not as a pure substring of the sentence
column.
The second approach fails when the candidate is a compound word. An example is:
+----------------+------------------------+
|candidate | sentence |
+----------------+------------------------+
| Roman emperor | He was a Roman emperor.|
+----------------+------------------------+
The second approach fails here because it turns the sentence column to an array of tokens: [He, was, a, Roman, emperor]
and none of them is equal to Roman emperor
.
Is there any way to resolve this issue?