Pyspark removing multiple characters in a dataframe column

Question

Looking at pyspark, I see translate and regexp_replace to help me a single characters that exists in a dataframe column.

I was wondering if there is a way to supply multiple strings in the regexp_replace or translate so that it would parse them and replace them with something else.

Use case: remove all $, #, and comma(,) in a column A

score 38 · Accepted Answer · answered Jun 08 '18 at 19:04

You can use pyspark.sql.functions.translate() to make multiple replacements. Pass in a string of letters to replace and another string of equal length which represents the replacement values.

For example, let's say you had the following DataFrame:

import pyspark.sql.functions as f
df = sqlCtx.createDataFrame([("$100,00",),("#foobar",),("foo, bar, #, and $",)], ["A"])
df.show()
#+------------------+
#|                 A|
#+------------------+
#|           $100,00|
#|           #foobar|
#|foo, bar, #, and $|
#+------------------+

and wanted to replace ('$', '#', ',') with ('X', 'Y', 'Z'). Simply use translate like:

df.select("A", f.translate(f.col("A"), "$#,", "XYZ").alias("replaced")).show()
#+------------------+------------------+
#|                 A|          replaced|
#+------------------+------------------+
#|           $100,00|           X100Z00|
#|           #foobar|           Yfoobar|
#|foo, bar, #, and $|fooZ barZ YZ and X|
#+------------------+------------------+

If instead you wanted to remove all instances of ('$', '#', ','), you could do this with pyspark.sql.functions.regexp_replace().

df.select("A", f.regexp_replace(f.col("A"), "[\$#,]", "").alias("replaced")).show()
#+------------------+-------------+
#|                 A|     replaced|
#+------------------+-------------+
#|           $100,00|        10000|
#|           #foobar|       foobar|
#|foo, bar, #, and $|foo bar  and |
#+------------------+-------------+

The pattern "[\$#,]" means match any of the characters inside the brackets. The $ has to be escaped because it has a special meaning in regex.

For removing all instances, you can also use `translate`. Switching to `regex_replace` is not needed. I believe the following would do the job `df.select("A", f.translate(f.col("A"), "$#,", "").alias("replaced")).show()` — Sheldore, Oct 31 '21 at 00:04
@Sheldore, your solution does not work properly. It replaces characters with space — MuscleUpUp, Sep 27 '22 at 12:13

score 1 · Answer 2 · answered Apr 05 '21 at 13:02

If someone need to do this in scala you can do this as below code:

val df = Seq(("Test$",19),("$#,",23),("Y#a",20),("ZZZ,,",21)).toDF("Name","age")
import org.apache.spark.sql.functions._
val df1 = df.withColumn("NewName",translate($"Name","$#,","xyz"))
display(df1)

You can see the output as below:

Pyspark removing multiple characters in a dataframe column

2 Answers2

Linked