3

I've recently been working on a project which involves crawling data in Amazon S3 using the Glue managed service. This successfully creates a metastore which I can see on the Glue console.

There are mistakes in the schema which I've manually resolved through the console, e.g. STRING --> TIMESTAMP, BIGINT --> STRING etc. A partition is automatically created seemingly from the key in the S3 path. i.e. my key for objects in s3 is something like:

s3://mybucket/YYYMM/object.csv

Glue successfully partitions the data by the YYYMM (e.g. 201711) part of the key.

When I run a query I get the following error HIVE_PARTITION_SCHEMA_MISMATCH Which suggests although the table schema has been updated, the partition schema has not

Looking in the docs I find... https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-syncing

Specifically

There are a few ways to fix this issue. First, if the data was accidentally added, you can remove the data files that cause the difference in schema, drop the partition, and re-crawl the data. Second, you can drop the individual partition and then run MSCK REPAIR within Athena to re-create the partition using the table's schema. This second option works only if you are confident that the schema applied will continue to read the data correctly.

So I've tried the second option and run.

ALTER TABLE mydb.mytable DROP PARTITION (partition_0=201711), PARTITION (partition_0=201712)
MSCK REPAIR TABLE mydb.mytable

Dropping the partitions appears to be successful, but running the repair tables yields

Partitions not in metastore: mytable:201711 mytable:201712

And I can't get any data back. Readding the partitions manually doesn't seem to work either.

e.g.

ALTER TABLE mydb.mytable ADD
    PARTITION (partition_0=201711) LOCATION 's3://bucket/201711',

Gives the error line 2:2: missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;

Any help would be appreciated

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
robarthur1
  • 457
  • 2
  • 9
  • 19

1 Answers1

5

In case anyone comes across this later, I found the answer to my problem in this question. https://stackoverflow.com/a/33895249/4537686

So changing the format of my key in my bucket from

s3://mybucket/YYYMM/object.csv to:

s3://mybucket/date=YYYMM/object.csv

and then running

ALTER TABLE mydb.mytable DROP PARTITION (partition_0=201711), PARTITION (partition_0=201712)
MSCK REPAIR TABLE mydb.mytable

re added my missing partitions.

robarthur1
  • 457
  • 2
  • 9
  • 19
  • When you do a MSCK repair table, it will list the missing file(s) to partition(s) in the Athena GUI. Do you know of a way to get a list of the missing files programmatically? – pitchblack408 Jul 21 '20 at 01:09
  • @pitchblack408 you can do "SHOW PARTITIONS" on the table + a list via S3 Cli in 2 differents files and do a diff between both files – tdebroc Dec 08 '20 at 19:24