6

I created a AWS Glue Job using Glue Studio. It takes data from a Glue Data Catalog, does some transformations, and writes to a different Data Catalog.

When configuring the target node, I enabled the option to create new partitions after running:

enter image description here

The job runs successfully, data is written to S3 with proper partition folder structure, but no new partitions are created in the actual Data Catalog table - I still have to run a Glue Crawler to create them.

The code in the generated script that is responsible for partition creation is this (last two lines of the job):

DataSink0 = glueContext.write_dynamic_frame.from_catalog(frame = Transform4, database = "tick_test", table_name = "test_obj", transformation_ctx = "DataSink0", additional_options = {"updateBehavior":"LOG","partitionKeys":["date","provider"],"enableUpdateCatalog":True})
job.commit()

What am I doing wrong? Why are new partitions not being created? How do I avoid having to run a crawler to have the data available in Athena?

I am using Glue 2.0 - PySpark 2.4

gshpychka
  • 8,523
  • 1
  • 11
  • 31
  • Just a question: do you have to run the crawler every single time you run the job (with the same schema)? Because running the crawler once after change is expected, but it should not be necessary for the future runs. – Coockson Mar 22 '21 at 15:30
  • Yes, I do, as the Glue job doesn't create new partitions in the data catalog. – gshpychka Mar 23 '21 at 14:18
  • @gshpychka i am having similar problem, where i have set enableUpdateCatalog=True and also updateBehaviour=LOG to update my glue table with 1 partition key. After the job, runs there are no new partitions added on my glue catalog table, but data in S3 is separated by the partition key i have used, how do i get the job to automatically partition my glue catalog table? – Vijeth Kashyap Sep 09 '22 at 12:05

1 Answers1

2

As highlighted in documentation, there are restrictions with adding new partitions to data catalogs, more specifically, please make sure your use case is not contradicting any of the following:

Only Amazon Simple Storage Service (Amazon S3) targets are supported.

Only the following formats are supported: json, csv, avro, and parquet.

To create or update tables with the parquet classification, you must utilize the AWS Glue optimized parquet writer for DynamicFrames.

When the updateBehavior is set to LOG, new partitions will be added only if the DynamicFrame schema is equivalent to or contains a subset of the columns defined in the Data Catalog table's schema.

Your partitionKeys must be equivalent, and in the same order, between your parameter passed in your ETL script and the partitionKeys in your Data Catalog table schema.

Gaurav Tiwari
  • 349
  • 2
  • 7
  • All of these hold true in my use case. – gshpychka Jun 06 '21 at 09:10
  • I can confirm this solved the problem for me _(I was using **ORC** rather than **Parquet**)_. - @gshpychka check if you are also using the **glue optimized parquet writer** and that you have the correct permissions to update the schema. – Luis Miguel Mejía Suárez Jun 06 '21 at 15:26
  • @LuisMiguelMejíaSuárez Can you please elaborate what permissions are needed for updating the schema? – Vijeth Kashyap Sep 12 '22 at 07:29
  • 1
    Hi @VijethKashyap I not longer work in that project and my memory is not the best, I asked one of my team mates and they replied this: _"The **IAM** role that is used to run the **Glue** job must have the following **S3** permissions: `PutObject`, `GetObject`, `ListBucket` & `DeleteObject` over the datalake’s bucket. Additionally, we also associated the following policies: `AWSGlueServiceRole` & `AWSLakeFormationDataAdmin` (because they were governed tables)"_; hope it helps. – Luis Miguel Mejía Suárez Sep 12 '22 at 17:17