In Pyspark, using a Spark SQL function such as regexp_extract or regexp_replace, raw strings (string literals prefixed with r
) are supported when running locally, but not when running on EMR.
A simple example to reproduce the issue is:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
which will run successfully on Pyspark 3.3.0 locally but raises the parsing exception:
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported
when executed on EMR. Looking at the session configuration options for Spark SQL, there doesn't appear to be any options which would change how raw strings are parsed - the closest option is spark.sql.parser.quotedRegexColumnNames
.
Anecdotally, I remember having a conversation with a colleague a few years ago who said something about AWS having an internal custom Spark for running on EMR, but I have not found any documentation to corroborate that. Also, even if that were the case, I imagine they would maintain support for critical features like this.
There also could be a Spark configuration option which I missed during my investigation.
For anyone who may have some deeper insight or recognizes the issue, why does this discrepancy exist?
Thank you in advance!
Related posts:
- SparkSQL Regular Expression: Cannot remove backslash from text (Developer tried a recommended solution for regex problem using raw strings, but failed with the parsing error on EMR - example code snippet based on this question)
- regexp extract pyspark sql: ParseException Literals of type 'R' are currently not supported ("If I try this code with Pyspark locally it works (Pyspark version 3.3.0), but when I run this code in an EMR job it fails, I'm using emr-6.6.0 as application")