Spark.sql run query without any optimizations (aws glue athena)

Question

i need to perform from a Glue Job (spark.sql) an SQL request to the Athena on AWS.

My query is very simple

df = spark.sql("select * from hashes 
               where year='2109' and month='10' and day='08' 
               and myhashes in (%s) order by timestamp desc" % ( 
               ",".join( "'"+str(x)+"'" for x in myhashes ))  )

This code produce a string like

select * from hashes where year='2019' 
     and month='10' and day='08' 
     and myhashes in (
    '06SN931', 
    '06SN931', 
    '06SP317', 
    ...........
    '86X0297'
    )

And it works in Athena very well

But if I run the Glue Job spark seems to convert the query from IN to OR syntax like

where .... day ='08' and (myhashes = '06XH8V3' or myhashes = '06X68P4' or my.....) and produce an error.

Here the exception
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:759)
    ... 64 more


Caused by: MetaException(message:1 validation error detected: Value 'year = '2019' and month = '10' and day = '08' and (myhashes = '06XH58V3' or myhashes = '06X658P4' or myhashes = '45X42051' or myhashes = '15S03560' or myhashes = '10S2868' or myhashes = '416S2661' or myhashes = 'dDSD' or myhashes = 'DSSD' or myhashes = '13XE639' or myhashes = '06X668N7' or myhashes = '06X364T2' or 
.......
myhashes = '96S652207' or myhashes = '06X26365M' or myhashes = '10X560c89' or myhashes = '06X01N8' or )' 


at 'expression' failed to satisfy constraint: Member must have length less than or equal to 2048 (Service: AWSGlue; Status Code: 400; Error Code: ValidationException; Request ID: 83f7bc7b-0d10-11ea-9a8c-fdfadfa2a22b))
            at com.amazonaws.glue.catalog.converters.CatalogToHiveConverter.getHiveException(CatalogToHiveConverter.java:100)
            at com.amazonaws.glue.catalog.converters.CatalogToHiveConverter.wrapInHiveException(CatalogToHiveConverter.java:88)
            at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getCatalogPartitions(GlueMetastoreClientDelegate.java:948)
            at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.getPartitions(GlueMetastoreClientDelegate.java:911)
            at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.listPartitionsByFilter(AWSCatalogMetastoreClient.java:1179)
            at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2255)
            ... 69 more

        End of LogType:stdout

Is there a way to disable spark internal optimizations for SQL?

Possible duplicate of [In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?](https://stackoverflow.com/questions/38729436/in-apache-spark-2-0-0-is-it-possible-to-fetch-a-query-from-an-external-database) — 10465355, Nov 22 '19 at 11:35

score 0 · Answer 1 · answered Nov 22 '19 at 23:23

0

The error message is hinting that your query is simply too long (more than 2048 characters). AWS Athena and AWS Glue have different constraints.

If possible, try to filter the table (“hashes”) by joining it with a table that contains the values of myhashes, rather than using the SQL in, whenever the number of elements to compare to gets large.

answered Nov 22 '19 at 23:23

Oliver W.

13,169
3
37
50

It's a workaround, not a solution. It's there a way to deactivate this constraints? – Vlad Nov 23 '19 at 19:21
@Vlad I highly doubt it. You're asking if you can request AWS to change the limits of one of their APIs. While you can typically request increases in consumed resources and throughput, I am not aware of changing API call restrictions like this one. Am not an AWS employee though, so perhaps you're better off asking them directly? Note that this issue has nothing to do with Spark internal optimizations, like you wrote, as Spark will happily allow you to make such query plans, even if there are better approaches. – Oliver W. Nov 27 '19 at 00:19

Spark.sql run query without any optimizations (aws glue athena)

1 Answers1