1

I understand that I can load an entire table from a JDBC Cataloged connection via the Glue context like so:

glueContext.create_dynamic_frame.from_catalog(
    database="jdbc_rds_postgresql",
    table_name="public_foo_table",
    transformation_ctx="datasource0"
)

However, what I'd like to do is partially load a table using the cataloged connection as if I were using an uncataloged JDBC connection via the spark session like this:

query = "(select * from public.foo_table where date='%s') as data" % date_fm)
spark.read.jdbc(url=DB_URL, table=query)

Is there a way I could use the Cataloged connection instead?

Alternatively, when using an uncataloged connection I'm having a hard time understanding how to lock down access to the uncataloged connection such that only a Glue job can access it. What are my options?

Rodrigue
  • 3,617
  • 2
  • 37
  • 49
Jeff
  • 1,426
  • 8
  • 19

2 Answers2

2

In order to perform predicate pushdown to databases. you need to use sampleQuery parameter as described in this AWS Doc

"sampleQuery": (Optional) The custom SQL query statement for sampling. By default the sample query is executed by single executor. If you're reading a large dataset, you may need to enable JDBC partitioning to query a table in parallel. For more information, see Reading from JDBC Tables in Parallel. To use sampleQuery with JDBC partitioning, also set enablePartitioningForSampleQuery to true.

ProTip - You need to enable enablePartitioningForSampleQuery to True. Also, end your SQL statement with "where" or "and" as descried in doc.

"enablePartitioningForSampleQuery": (Optional) By default this option is false. Required if you want to use sampleQuery with a partitioned JDBC table. If set to true, sampleQuery must end with "where" or "and" for AWS Glue to append partitioning conditions. See the example below.

Here is my code snippet – where I am brining data for sporting_event_id = 958 from MSSQL Database.

Also, I have added parallelism in my jdbc connection by passing parameter hashpartitions

S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
    database="mssql_dms_sample",
    table_name="mssql_dms_sample_dbo_sporting_event_ticket",
    transformation_ctx="S3bucket_node1",
    additional_options = {  "hashpartitions": "20" , "hashfiled":"id", "hashexpression":"sporting_event_id","enablePartitioningForSampleQuery":True, "sampleQuery":"select * from sporting_event_ticket where sporting_event_id = 958 and"}
)
0

In theory, you can achieve this with a catalogued connection in Glue using Pushdown predicates, where you add a push_down_predicate parameter to filter your data when reading it. So, using your example:

glueContext.create_dynamic_frame.from_catalog(
    database="jdbc_rds_postgresql",
    table_name="public_foo_table",
    transformation_ctx="datasource0",
    push_down_predicate="date=%s" % date_fm
)

However, it seems that at the time of writing this is only supported for S3 sources.

Rodrigue
  • 3,617
  • 2
  • 37
  • 49
Tanveer Uddin
  • 1,520
  • 9
  • 15
  • Currently pushdown predicate feature is supported for `s3` storage only. However, as far as I know AWS Glue team is working on implementing it for JDBC connections. – Yuriy Bondaruk Sep 25 '18 at 09:04
  • @YuriyBondaruk do you have any reference to the current s3 limitation? – Rodrigue Jan 29 '19 at 09:24
  • You can find it here: https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/. It doesn't clearly states that S3 is only supported though but I think it's the only public resource where AWS explain pushdown predicate feature in details – Yuriy Bondaruk Jan 29 '19 at 15:34