How to specify a different column for the second argument of the regexp_replace function?

Question

I tried to replace part of the column text with "?". But I got an error. I'd like to use the native dataframe in spark.

TypeError: Column is not iterable

data = [("John Doe", "123 Main St, Anytown, USA"),
        ("Jane Smith", "456 High St, Somewhere, USA")]

df = spark.createDataFrame(data, ["name", "address"])

from pyspark.sql.functions import regexp_replace, col
start = 3
end = 10
df_masked = df.withColumn("address_masked", regexp_replace(col("address"), substring("address", start, end), "??????????"))

expected result

name	address_masked
John Doe	123????????, Anytown, USA
Jane Smith	456??????????, Somewhere, USA

Possible duplicate of - [Pyspark replace string from column based on pattern from another column](https://stackoverflow.com/questions/56371701/pyspark-replace-string-from-column-based-on-pattern-from-another-column) — Jay Shankar Gupta, Apr 18 '23 at 07:06

score 0 · Answer 1 · answered Apr 18 '23 at 01:12

Try with expr() to iterate through the column using substring function

Example:

df_masked = df.withColumn("address_masked", expr(f'regexp_replace(address, substring(address, {start}, {end}), "??????????")'))
#name   address address_masked
#John Doe   123 Main St, Anytown, USA   12?????????? Anytown, USA
#Jane Smith 456 High St, Somewhere, USA 45?????????? Somewhere, USA

How to specify a different column for the second argument of the regexp_replace function?

1 Answers1