How to prevent AWS Glue crawler creating duplicate schema on my table

Question

I have a workflow that creates incremental parquet files daily as events get generated in our system.

Every time it runs it adds a new parquet file in an s3 partition for the day like so

S3
|->date=1
| |-> XXX.parquet // {name: "alice", app_info.tag_1: "foo", app_info.tag_2: "bar"}
|
|->date=2
  |-> YYY.parquet // {name: "alice", app_info.tag_3: "baz", app_info.tag_2: "bar", app_info.tag_1: "bar"}

The problem is my Glue Table schema looks like this after the first run on day 1

[
  {
    "Name": "name",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_1",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_2",
    "Type": "string"
  },
]

which is correct, but after the second run, it gets a completely random amount of fields from day 2 added into the schema like so:

[
  {
    "Name": "name",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_1",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_2",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_1", // this is from the event from day 2. This is also a 
    "Type": "string"          // duplicate from existing schema
  },
  {
    "Name": "app_info.tag_3", // this is from the event from day 2
    "Type": "string"
  },
]

The schema field app_info.tag_1 is added to my table again, even though it's already there. Is this the expected behaviour? If so is there documentation for it? Also how do I remove such duplicate?

The schema for app_info.tag is very dynamic so it can be anything and I would like for Glue to add any new fields to the schema without duplicating existing fields, as Athena would complain about duplicate columns then.

I have selected the option to create a single schema for each s3 path, as well as recrawl all, and Update the table definition in the data catalog

TIA

How to prevent AWS Glue crawler creating duplicate schema on my table

0 Answers0