0

I'm trying to read data from ScyllaDB and want to remove \n and \r character from a column. The problem is that these characters are stored as string in the column of a table being read and I need to use REGEX_REPLACE as I'm using Spark SQL for this. The regex pattern don't seem to work which work in MySQL. The string becomes blank but doesn't remove the characters. Below is the snippet of the query being used in Spark SQL. Help appreciated.

The following string is present in the message column: 'hello\nworld\r'

The expected output is 'hello world'

df=spark.sql("select  REGEXP_REPLACE(message,'\n|\r|\r\n',' ') as replaced_message from delivery_sms")
ash
  • 1
  • 4
  • 1
    If you are literally trying to replace the string `\n` or `\r`, you need to escape the slash - `regexp_replace(message, '\\n|\\r, ' ')` – Andrew Jul 28 '22 at 14:05
  • @Andrew , It is not working in case of pyspark sql but it works in mysql query. Can you please suggest any other way (another function) which I can apply that will be helpful – ash Jul 29 '22 at 10:40
  • Also I have referred link https://stackoverflow.com/questions/56371701/pyspark-replace-string-from-column-based-on-pattern-from-another-column I tried with dataframe which I have read from scyllaDB but it is not working with that dataframe . But when I tried same example as it is given in link it works. If you can please let me know what would be the reason – ash Jul 29 '22 at 11:32
  • Hrm, that's odd. I can't make it work in spark sql either. You can do it using regexp_replace and withColumn on the data frame though. You have to use 4 slashes for each - `df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()`. – Andrew Jul 29 '22 at 18:19

2 Answers2

0

Thanks Andrew's for the answer.

The following worked for me:

df.withColumn("test",regexp_replace("_c0","\\\\n|\\\\r"," ")).show()
ash
  • 1
  • 4
  • Welcome to SO. The answer is very brief and can be improved by providing more details. It will help others understand. Please read https://stackoverflow.com/help/how-to-answer. – Azhar Khan Aug 08 '22 at 04:30
0

If anybody is looking for a modern solution to this problem:

df.select(
    F.translate(F.col("test"), "\n\t", " " * len("\n\t")).alias("test")
)
ciurlaro
  • 742
  • 10
  • 22