I have a workflow that creates incremental parquet files daily as events get generated in our system.
Every time it runs it adds a new parquet file in an s3 partition for the day like so
S3
|->date=1
| |-> XXX.parquet // {name: "alice", app_info.tag_1: "foo", app_info.tag_2: "bar"}
|
|->date=2
|-> YYY.parquet // {name: "alice", app_info.tag_3: "baz", app_info.tag_2: "bar", app_info.tag_1: "bar"}
The problem is my Glue Table schema looks like this after the first run on day 1
[
{
"Name": "name",
"Type": "string"
},
{
"Name": "app_info.tag_1",
"Type": "string"
},
{
"Name": "app_info.tag_2",
"Type": "string"
},
]
which is correct, but after the second run, it gets a completely random amount of fields from day 2 added into the schema like so:
[
{
"Name": "name",
"Type": "string"
},
{
"Name": "app_info.tag_1",
"Type": "string"
},
{
"Name": "app_info.tag_2",
"Type": "string"
},
{
"Name": "app_info.tag_1", // this is from the event from day 2. This is also a
"Type": "string" // duplicate from existing schema
},
{
"Name": "app_info.tag_3", // this is from the event from day 2
"Type": "string"
},
]
The schema field app_info.tag_1
is added to my table again, even though it's already there. Is this the expected behaviour? If so is there documentation for it? Also how do I remove such duplicate?
The schema for app_info.tag
is very dynamic so it can be anything and I would like for Glue to add any new fields to the schema without duplicating existing fields, as Athena would complain about duplicate columns then.
I have selected the option to create a single schema for each s3 path, as well as recrawl all, and Update the table definition in the data catalog
TIA