3

I am trying to extract regex patterns from a column using PySpark. I have a data frame which contains the regex patterns and then a table which contains the strings I'd like to match.

columns = ['id', 'text']
vals = [
 (1, 'here is a Match1'),
 (2, 'Do not match'),
 (3, 'Match2 is another example'),
 (4, 'Do not match'),
 (5, 'here is a Match1')
]

df_to_extract = sql.createDataFrame(vals, columns)


columns = ['id', 'Regex', 'Replacement']
vals = [
(1, 'Match1', 'Found1'),
(2, 'Match2', 'Found2'),
]

df_regex = sql.createDataFrame(vals, columns)

I'd like to match the 'Regex' column within the 'text' column of 'df_to_extract'. I'd like to extract the terms against each id with the resulting table containing the id and 'replacement' which corresponds to the 'Regex'. For example:

+---+------------+
| id| replacement|
+---+------------+
|  1|      Found1|
|  3|      Found2|
|  5|      Found1|
+---+------------+

Thanks!

pault
  • 41,343
  • 15
  • 107
  • 149
shbfy
  • 2,075
  • 3
  • 16
  • 37

1 Answers1

4

One way is to use a pyspark.sql.functions.expr, which allows you to use a column value as a parameter, in your join condition.

For example:

from pyspark.sql.functions import expr
df_to_extract.alias("e")\
    .join(
        df_regex.alias("r"), 
        on=expr(r"e.text LIKE concat('%', r.Regex, '%')"),
        how="inner"
    )\
    .select("e.id", "r.Replacement")\
    .show()
#+---+-----------+
#| id|Replacement|
#+---+-----------+
#|  1|     Found1|
#|  3|     Found2|
#|  5|     Found1|
#+---+-----------+

Here I used the sql expression:

e.text LIKE concat('%', r.Regex, '%')

Which will join all rows where the text column is like the Regex column with the % acting as wildcards to capture anything before and after.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Thank you! Would this work if there are multiple matches? – shbfy Jan 22 '19 at 16:27
  • 1
    @avocet this will return one row per match. So one row in `df_to_extract` can match to multiple rows in `df_regex`. In other words, you can have multiple rows with the same `id` in the output but each will have a different `Replacement`. – pault Jan 22 '19 at 16:33