Filtering JDBC Ingestion with AWS Glue and PySpark

Question

I am using AWS Glue to ingest from a mysql database. I know that I can use custom queries when using pyspark-JDBC to ingest data. Does the same apply for when ingesting based on a crawler? Right now I am using this:

datasource =glueContext.create_dynamic_frame.from_catalog(database="db_name",table_name="table_name")

Is there any way that I can ingest, instead of the whole table, only part of it? Like using a select * from table where column_x > value.

You cannot apply a filter on JDBC table when loading from Glue catalog. but you can use JDBC connection to push down the filter to database engine as explained in https://stackoverflow.com/a/54375010/4326922 . — Prabhakar Reddy, Sep 30 '20 at 15:42
Thanks for the reply. I was hoping I could avoid that and just ingest from glue catalog directly! Thanks though! — Gerasimos, Sep 30 '20 at 16:55
you should try job bookmarks then https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html but you cannot use SQL here too. — Prabhakar Reddy, Oct 01 '20 at 02:49
Thanks again. The problem is with the initial ingestion due to the large volume of the data (after that, bookmarks will be used). If I use "hashfield": "datetime_field"(I know I still won't be able to specify that only e.g. 2019 data should be ingested), will it help or it only works against physical partitions? — Gerasimos, Oct 05 '20 at 12:27

Filtering JDBC Ingestion with AWS Glue and PySpark

0 Answers0