In PySpark, using regexp_replace, how to replace a group with value from another column?

Question

I have a dataframe with two columns: filename and year. I want to replace the year value in filename with value from year column

Third column in the below table demonstrates the requirement:

+----------------------------+------+----------------------------+
| filename                   | year | reqd_filename              |
+----------------------------+------+----------------------------+
| blah_2020_v1_blah_blah.csv | 1975 | blah_1975_v1_blah_blah.csv |
+----------------------------+------+----------------------------+
| blah_2019_v1_blah_blah.csv | 1984 | blah_1984_v1_blah_blah.csv |
+----------------------------+------+----------------------------+

Code currently looks like below:

df = df.withColumn('filename', F.regexp_replace(F.col('filename',), '(blah_)(.*)(_v1.*)', <Nothing I put here works>))

In short, I want to replace the second group with year column from df

Cena · Accepted Answer · 2020-10-30T21:04:55.317

5

You can do this using expr.
I'm using ([0-9]{4}) as the regex pattern for detecting a year in filename.

from pyspark.sql.functions import expr

df.withColumn("reqd_filename",expr("regexp_replace(filename, \
        '([0-9]{4})', year)")).show()

+--------------------------+----+--------------------------+                    
|filename                  |year|reqd_filename             |
+--------------------------+----+--------------------------+
|blah_2020_v1_blah_blah.csv|1975|blah_1975_v1_blah_blah.csv|
|blah_2019_v1_blah_blah.csv|1984|blah_1984_v1_blah_blah.csv|
+--------------------------+----+--------------------------+

edited Oct 30 '20 at 21:04

answered Oct 30 '20 at 19:12

Cena

3,316
2
17
34

This is totally the correct answer. What a shame that `F.regexp_replace(...)` doesn't allow to use a Column as a third parameter just like the `F.expr("regexp_replace(...)")` version does. It just makes everything dirtier. – ciurlaro Feb 01 '21 at 13:40
Can you also obtain the second or third group this way? – Siete Oct 11 '22 at 10:44

In PySpark, using regexp_replace, how to replace a group with value from another column?

1 Answers1