I have data imbedded in my text fields that I need to suppress. The data is in the format of \nnn where nnn is a 3 digit number. I tried the following:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()
I expected the result to be "ABC\XYZ\", but what I got instead was:
+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+
I'm not sure what the other characters are after the C and after Z.
However I need to remove the backslash as well. To get rid of the backslash, I then tried this:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value").show()
However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.
I appreciate any suggestions!
EDIT: I tried Daniel's suggestion but received the following errors:
>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)
== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value
----------------------^^^
Thoughts?