Why are raw strings not supported in Spark SQL when running in EMR?

Question

In Pyspark, using a Spark SQL function such as regexp_extract or regexp_replace, raw strings (string literals prefixed with r) are supported when running locally, but not when running on EMR.

A simple example to reproduce the issue is:

from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .appName("test")\
        .getOrCreate()
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()

which will run successfully on Pyspark 3.3.0 locally but raises the parsing exception:

pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported

when executed on EMR. Looking at the session configuration options for Spark SQL, there doesn't appear to be any options which would change how raw strings are parsed - the closest option is spark.sql.parser.quotedRegexColumnNames.

Anecdotally, I remember having a conversation with a colleague a few years ago who said something about AWS having an internal custom Spark for running on EMR, but I have not found any documentation to corroborate that. Also, even if that were the case, I imagine they would maintain support for critical features like this.

There also could be a Spark configuration option which I missed during my investigation.

For anyone who may have some deeper insight or recognizes the issue, why does this discrepancy exist?

Thank you in advance!

Why are raw strings not supported in Spark SQL when running in EMR?

0 Answers0