Regex to replace multiple occurrence of a string in spark dataframe column using scala

Question

I have a column where a particular string appears multiple times. The number of occurrence is not fixed. I can get such string any number of times.

Example: Column description has the following data

The account account has been cancelled for the account account account and with the account

Here basically I want to replace multiple concurrent occurrence of account with single one

Expected Output:

The account has been cancelled for the account and with the account

score 1 · Accepted Answer · answered Apr 08 '21 at 10:24

You can use the regex pattern (source: java regular expression to remove duplicated words) with regexp_replace to replace duplicate words:

val df = spark.sql("select 'The account account has been cancelled for the account account account and with the account' col")

df.show(false)
+-------------------------------------------------------------------------------------------+
|col                                                                                        |
+-------------------------------------------------------------------------------------------+
|The account account has been cancelled for the account account account and with the account|
+-------------------------------------------------------------------------------------------+

val df2 = df.withColumn("col", regexp_replace(col("col"), "\\b(\\w+)(\\b\\W+\\b\\1\\b)*", "$1"))

df2.show(false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|The account has been cancelled for the account and with the account|
+-------------------------------------------------------------------+

Regex to replace multiple occurrence of a string in spark dataframe column using scala

1 Answers1