1

I am using AWS Glue to ingest from a mysql database. I know that I can use custom queries when using pyspark-JDBC to ingest data. Does the same apply for when ingesting based on a crawler? Right now I am using this:

datasource =glueContext.create_dynamic_frame.from_catalog(database="db_name",table_name="table_name")

Is there any way that I can ingest, instead of the whole table, only part of it? Like using a select * from table where column_x > value.

Gerasimos
  • 279
  • 2
  • 8
  • 17
  • You cannot apply a filter on JDBC table when loading from Glue catalog. but you can use JDBC connection to push down the filter to database engine as explained in https://stackoverflow.com/a/54375010/4326922 . – Prabhakar Reddy Sep 30 '20 at 15:42
  • Thanks for the reply. I was hoping I could avoid that and just ingest from glue catalog directly! Thanks though! – Gerasimos Sep 30 '20 at 16:55
  • 1
    you should try job bookmarks then https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html but you cannot use SQL here too. – Prabhakar Reddy Oct 01 '20 at 02:49
  • Thanks again. The problem is with the initial ingestion due to the large volume of the data (after that, bookmarks will be used). If I use "hashfield": "datetime_field"(I know I still won't be able to specify that only e.g. 2019 data should be ingested), will it help or it only works against physical partitions? – Gerasimos Oct 05 '20 at 12:27

0 Answers0