3

I have a dataframe with a text column and a name column. I would like to check if the name exists in the text column and if it does to replace it with some value. I was hoping that the following would work:

df = df.withColumn("new_text",regex_replace(col("text),col("name"),"NAME"))

but Column is not iterable so it does not work. Do I have to write a udf to do that? How would that look like?

Eyal S.
  • 1,141
  • 4
  • 17
  • 29
  • Possible duplicate of https://stackoverflow.com/questions/45615621/spark-column-string-replace-when-present-in-other-column-row – giser_yugang May 30 '19 at 09:21
  • Possible duplicate of [Using a column value as a parameter to a spark DataFrame function](https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function) – pault May 30 '19 at 14:05
  • @giser_yugang this is pyspark your linked question is about scala. – Idodo Dec 07 '20 at 10:19

1 Answers1

5

You are almost close. Here is detailed example with withColumn and selectExpr options:

Sample df

df = spark.createDataFrame([('This is','This'),
('That is','That'),
('That is','There')],
['text','name'])

#+-------+-----+
#|   text| name|
#+-------+-----+
#|This is| This|
#|That is| That|
#|That is|There|
#+-------+-----+

Option 1: withColumn using expr function

from pyspark.sql.functions import expr, regexp_replace

df.withColumn("new_col1",expr("regexp_replace(text,name,'NAME')")).show()

#+-------+-----+--------+
#|   text| name|new_col1|
#+-------+-----+--------+
#|This is| This| NAME is|
#|That is| That| NAME is|
#|That is|There| That is|
#+-------+-----+--------+

Option 2: selectExpr using regexp_replace

 from pyspark.sql.functions import regexp_replace


df.selectExpr("*",
          "regexp_replace(text,name,'NAME') AS new_text").show()

#+-------+-----+--------+
#|   text| name|new_text|
#+-------+-----+--------+
#|This is| This| NAME is|
#|That is| That| NAME is|
#|That is|There| That is|
#+-------+-----+--------+
Shantanu Sharma
  • 3,661
  • 1
  • 18
  • 39
  • Do you happen to know how to handle the case when name is a regex expression? I'm seeing an issue with expr("regexp_replace(column, 'regex', 'replace_value')") – user43395 Jul 27 '20 at 19:54
  • To add, it's because the 'regex' is regex, but it being surrounded as a string from expr seems to interfere. – user43395 Jul 27 '20 at 20:37
  • I think I solved it, but not sure why. ^([^.]+)?\\. worked instead of ^.*?\\. (but the latter works when I don't use expr) – user43395 Jul 27 '20 at 20:41