Using \P{C} in Spark SQL regexp_replace

Question

I understand \P{C} represents "invisible control characters and unused code points" https://www.regular-expressions.info/unicode.html

When I do this, (in a databricks notebook) it works fine:

%sql
SELECT regexp_replace('abcd', '\\P{C}', 'x')

But the following fails (both %python and %scala):

%python 
s = "SELECT regexp_replace('abcd', '\\P{C}', 'x')"
display(spark.sql(s))

java.util.regex.PatternSyntaxException: Illegal repetition near index 0
P{C}
^

The SQL command also works fine in HIVE. I also tried escaping the curly braces as suggested here, but no use.

Is there anything else I am missing? Thanks.

notNull · Accepted Answer · 2020-06-08T05:27:45.267

Spark-Sql Api: Try adding 4 backslashes to escape 1 \

spark.sql("SELECT regexp_replace('abcd', '\\\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//|                          xxxx|
//+------------------------------+

spark.sql("SELECT string('\\\\')").show()
//+-----------------+
//|CAST(\ AS STRING)|
//+-----------------+
//|                \|
//+-----------------+

(Or)

enable escapedStringLiterals property to fall back to Spark-1.6 string literal

spark.sql("set spark.sql.parser.escapedStringLiterals=true")
spark.sql("SELECT regexp_replace('abcd', '\\P{C}', 'x')").show()
//+------------------------------+
//|regexp_replace(abcd, \P{C}, x)|
//+------------------------------+
//|                          xxxx|
//+------------------------------+

In DataFrame-Api: add 2 backslashes \\ to escape 1 \

df.withColumn("dd",regexp_replace(lit("abcd"), "\\P{C}", "x")).show()
//+-----+----+
//|value|  dd|
//+-----+----+
//|    1|xxxx|
//+-----+----+

df.withColumn("dd",lit("\\")).show()
//+-----+---+
//|value| dd|
//+-----+---+
//|    1|  \|
//+-----+---+

Thank you. Any insight into why there are different between those two apis? — Gadam, Jun 08 '20 at 05:11
@Gadam, From Spark-2.0 there are some changes happened to the string literals.. more information found here: https://github.com/apache/spark/pull/25001/files/24a796e558d2f22ef2dc1e7bf919d30a7959d6d3#diff-39298b470865a4cbc67398a4ea11e767R86 — notNull, Jun 08 '20 at 05:30

Using \P{C} in Spark SQL regexp_replace

1 Answers1