PySpark with SynapseSQL throws error when using queries with restricted keywords in column name

Question

I'm quering SQL Database on Azure Synapse Analytics server with PySpark. However, some of my column names may contain restricted keyword(s). Here, I've prepared a generic example which I could bypass with changing the name of c.Name as something different (restricted word is "close"). However, this is not possible because I need to JOIN 2 tables with columns which contain restricted word. So, here is my generic example:

query = """Select c.Name as ClosedByName, c.Profile_Name__c FROM dbo.table as c"""

# Read from a query
dfToReadFromQueryAsArgument = (spark.read
                     .option(Constants.DATABASE, "server")
                     .option(Constants.SERVER, "workspace.sql.azuresynapse.net")
                     .synapsesql(query)
)
dfToReadFromQueryAsArgument.show()

The interesting thing that script will work with following queries:

query = """Select c.Name, c.Profile_Name__c FROM dbo.table as c"""
query = """Select c.Name as losedByName, c.Profile_Name__c FROM dbo.table as c"""

I've tried to use back ticks as suggested in different posts (closed). However, this didn't work. I've also tried other escape/quoting characters. But none of them worked.

So, I need to find a way to avoid this check, either by quoting the text or forcing it to run regardless of using restricted word (close is not the same as closed). I think the check is too agressive in my case.

Error message:

Py4JJavaError                             Traceback (most recent call last)
/tmp/ipykernel_6968/4240601845.py in <module>
      1 query = """Select c.Name as ClosedByName,       c.Profile_Name__c FROM dbo.table as c"""
      2 
----> 3 dfToReadFromQueryAsOption = (spark.read
      4                      # Name of the SQL Dedicated Pool or database where to run the query
      5                      # Database can be specified as a Spark Config - spark.sqlanalyticsconnector.dw.database or as a Constant - Constants.DATABASE

~/cluster-env/env/lib/python3.8/site-packages/com/microsoft/spark/sqlanalytics/SqlAnalyticsReader.py in synapsesql(self, table_name)
     40         df = DataFrame(jdf, sqlcontext)
     41     except Exception as e:
---> 42         raise e
     43     return df

~/cluster-env/env/lib/python3.8/site-packages/com/microsoft/spark/sqlanalytics/SqlAnalyticsReader.py in synapsesql(self, table_name)
     37         connector = sqlcontext._jvm.com.microsoft.spark.sqlanalytics.SqlAnalyticsConnectorClass() \
     38             .SQLAnalyticsFormatReader(self._jreader)
---> 39         jdf = connector.synapsesql(table_name)
     40         df = DataFrame(jdf, sqlcontext)
     41     except Exception as e:

~/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~/cluster-env/env/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o3616.synapsesql.
: com.microsoft.spark.sqlanalytics.SQLAnalyticsConnectorException: Queries with keywords: 
create
alter
drop
with
exec
execute
insert
delete
disable
enable
update
merge
truncate
backup
restore
collate
close
deny
grant
open
revoke
revert are not allowed
    at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsConnectorOptionsValidator$.validateOptions(SQLAnalyticsConnectorOptionsValidator.scala:118)
    at com.microsoft.spark.sqlanalytics.utils.SQLAnalyticsConnectorOptionsValidator$.validateOptions(SQLAnalyticsConnectorOptionsValidator.scala:68)
    at com.microsoft.spark.sqlanalytics.utils.Utils$.initializeAndValidateOptions(Utils.scala:122)
    at com.microsoft.spark.sqlanalytics.ItemsTable.readSchema(ItemsTable.scala:96)
    at com.microsoft.spark.sqlanalytics.ItemsTable.$anonfun$schema$1(ItemsTable.scala:88)
    at scala.Option.getOrElse(Option.scala:189)
    at com.microsoft.spark.sqlanalytics.ItemsTable.schema(ItemsTable.scala:88)
    at com.microsoft.spark.sqlanalytics.SynapseSqlDataSourceV2.inferSchema(SynapseSqlDataSourceV2.scala:46)
    at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:81)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:303)
    at scala.Option.map(Option.scala:230)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:273)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at com.microsoft.spark.sqlanalytics.SqlAnalyticsConnectorClass$SQLAnalyticsFormatReader.sqlanalytics(SqlAnalyticsConnectorClass.scala:105)
    at com.microsoft.spark.sqlanalytics.SqlAnalyticsConnectorClass$SQLAnalyticsFormatReader.synapsesql(SqlAnalyticsConnectorClass.scala:82)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

@nbk - I've tried this [ ]. This works in normal SQL but it's not working wiht pyspark on Azure — Lukasz, Aug 03 '23 at 14:53
@nbk - I disagree. I literally have scripts when I have c missing from word "closed". So, with "losed" it works. IMHO this is a bug. I am trying to find a solution for it — Lukasz, Aug 04 '23 at 07:47
@nbk - I've updated my post to include error messages and examples of code, which work and don't. Maybe I need to change version of Python or PySpark? I am surprise that I couldn't find other people having this issue... — Lukasz, Aug 04 '23 at 08:12
this looks more loke a bug in java, but i can't think, why this would be why not simply change to TerminatedByName and go on with your life — nbk, Aug 04 '23 at 08:40
I have also noticed strange quirks like this between Spark pool in synapse and something like spark in databricks. Probably the faulty side is synapse. — Ziya Mert Karakas, Aug 07 '23 at 23:30

Subash · Answer 1 · 2023-08-18T06:03:57.720

-1

I created a table with column SELECT and then queried it using F-string

# Add required imports
import com.microsoft.spark.sqlanalytics
from com.microsoft.spark.sqlanalytics.Constants import Constants
from pyspark.sql.functions import col

column_name = "SELECT"

dfToReadFromQueryAsOption = (spark.read
                     .option(Constants.SERVER, "<server>.sql.azuresynapse.net")
                     .option(Constants.TEMP_FOLDER, "abfss://<container>@<storage>.dfs.core.windows.net/sql")
                     .synapsesql("test.dbo.demotest")
                     .select(f"{column_name}")
                    )


# Show contents of the dataframe
display(dfToReadFromQueryAsOption)

edited Aug 18 '23 at 06:03

answered Aug 18 '23 at 04:22

Subash

887
1
8
19

SELECT is not restricted word. Here is list of restricted words: create alter drop with exec execute insert delete disable enable update merge truncate backup restore collate close deny grant open revoke revert – Lukasz Aug 29 '23 at 12:49
so my script works with function: dfToReadFromQueryAsArgument or dfToReadFromQueryAsOption. However, it fails if you use word create - for example in column "CreatedBy". – Lukasz Aug 29 '23 at 12:51
Did you tried with F-String? – Subash Aug 30 '23 at 12:53

PySpark with SynapseSQL throws error when using queries with restricted keywords in column name

1 Answers1