0

I'd like to perform an atypical regexp_replace in PySpark based on two columns: I have in one attribute the address and in another one the city and I would like to use the city attribute to delete it from the address, when is present. I have written a function to do this:

df = spark.createDataFrame(
[
    (1, 'hügelstrasse 34, ansbach', 'ansbach'),
    (2, 'panton st. 2, london', 'london')
],
   ('id', 'address', 'city')
)

def dropCityAddress(street, city):

    new = regexp_replace(street, city, '')

    return(new)

df.withColumn('newaddress', dropCityAddress(col('address'), col('city')))

but the city object is not iterable. The desired output would be a new column without the city in the address (I am not interested in commas or other stuff, just deleting the city). I will perform this task on a big database, so a solution based on something like a collect action would not be suited for this problem.

Is there a way to perform this task?

unter_983
  • 167
  • 6
  • 1
    please provide some sample data and a reproducible scenario for your issue, you can follow the guidelines defined [here](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) for further details – abiratsis Jul 02 '21 at 09:46

1 Answers1

3

Check below code.

df.withColumn("newaddress",expr("regexp_replace(address,city,'')")).show(false)
+---+------------------------+-------+-----------------+
|id |address                 |city   |newaddress       |
+---+------------------------+-------+-----------------+
|1  |hügelstrasse 34, ansbach|ansbach|hügelstrasse 34, |
|2  |panton st. 2, london    |london |panton st. 2,    |
+---+------------------------+-------+-----------------+
Srinivas
  • 8,957
  • 2
  • 12
  • 26