0

I have a workflow that creates incremental parquet files daily as events get generated in our system.

Every time it runs it adds a new parquet file in an s3 partition for the day like so

S3
|->date=1
| |-> XXX.parquet // {name: "alice", app_info.tag_1: "foo", app_info.tag_2: "bar"}
|
|->date=2
  |-> YYY.parquet // {name: "alice", app_info.tag_3: "baz", app_info.tag_2: "bar", app_info.tag_1: "bar"}

The problem is my Glue Table schema looks like this after the first run on day 1

[
  {
    "Name": "name",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_1",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_2",
    "Type": "string"
  },
]

which is correct, but after the second run, it gets a completely random amount of fields from day 2 added into the schema like so:

[
  {
    "Name": "name",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_1",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_2",
    "Type": "string"
  },
  {
    "Name": "app_info.tag_1", // this is from the event from day 2. This is also a 
    "Type": "string"          // duplicate from existing schema
  },
  {
    "Name": "app_info.tag_3", // this is from the event from day 2
    "Type": "string"
  },
]

The schema field app_info.tag_1 is added to my table again, even though it's already there. Is this the expected behaviour? If so is there documentation for it? Also how do I remove such duplicate?

The schema for app_info.tag is very dynamic so it can be anything and I would like for Glue to add any new fields to the schema without duplicating existing fields, as Athena would complain about duplicate columns then.

I have selected the option to create a single schema for each s3 path, as well as recrawl all, and Update the table definition in the data catalog

TIA

Ryan W Kan
  • 380
  • 3
  • 16

0 Answers0