1

I'm using Pyspark SQL with regexp_extract in this way:

df = spark.createDataFrame([['id_20_30', 10], ['id_40_50', 30]], ['id', 'age'])
df.createOrReplaceTempView("table")
sql_statement="""
select regexp_extract(id, r'(\d+)', 1) as id
from table
"""
df = spark.sql(sql_statement)

But I obtain this error:

Traceback (most recent call last):
  File "/tmp/spark-5b33a47c-6490-4f80-9a97-3acb37f683ec/script.py", line 86, in <module>
    df = spark.sql(sql_statement)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
  File "/usr/lib/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1310, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.ParseException: 
Literals of type 'R' are currently not supported.

The result should be

+---+
| id|
+---+
| 20|
| 40|
+---+

If I remove the r it doesn't fail but it creates a wrong result. If I try this code with Pyspark locally it works (Pyspark version 3.3.0), but when I run this code in an EMR job it fails, I'm using emr-6.6.0 as application. Do you have any suggestion? I know I could try to not use pyspark sql but I would prefer to proceed with it.

solopiu
  • 718
  • 1
  • 9
  • 28

1 Answers1

2

You need to escape the '\' with '\' and you need two of them.

sql_statement="select regexp_extract(id, '([0-9]+)', 1) as id from table "
#or
sql_statement="select regexp_extract(id, '(\\\\d+)', 1) as id from table "
df = spark.sql(sql_statement)
df.show()
+---+
| id|
+---+
| 20|
| 40|
+---+
Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • It worked! Thanks a lot! Can you add an explanation on that? Why do we need 3 '\' every '\' in the pattern? What does that mean? – solopiu Jul 15 '22 at 12:48
  • 2
    "strings" escape '\\' as '\' and you need need to escape '\' in what you pass to SQL. so '\\d' in SQL needs to be written '\\d' and in a string that means you need '\\\\d' – Matt Andruff Jul 15 '22 at 13:36