I've recently been working on a project which involves crawling data in Amazon S3 using the Glue managed service. This successfully creates a metastore which I can see on the Glue console.
There are mistakes in the schema which I've manually resolved through the console, e.g. STRING --> TIMESTAMP, BIGINT --> STRING etc. A partition is automatically created seemingly from the key in the S3 path. i.e. my key for objects in s3 is something like:
s3://mybucket/YYYMM/object.csv
Glue successfully partitions the data by the YYYMM (e.g. 201711) part of the key.
When I run a query I get the following error
HIVE_PARTITION_SCHEMA_MISMATCH
Which suggests although the table schema has been updated, the partition schema has not
Looking in the docs I find... https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-syncing
Specifically
There are a few ways to fix this issue. First, if the data was accidentally added, you can remove the data files that cause the difference in schema, drop the partition, and re-crawl the data. Second, you can drop the individual partition and then run MSCK REPAIR within Athena to re-create the partition using the table's schema. This second option works only if you are confident that the schema applied will continue to read the data correctly.
So I've tried the second option and run.
ALTER TABLE mydb.mytable DROP PARTITION (partition_0=201711), PARTITION (partition_0=201712)
MSCK REPAIR TABLE mydb.mytable
Dropping the partitions appears to be successful, but running the repair tables yields
Partitions not in metastore: mytable:201711 mytable:201712
And I can't get any data back. Readding the partitions manually doesn't seem to work either.
e.g.
ALTER TABLE mydb.mytable ADD
PARTITION (partition_0=201711) LOCATION 's3://bucket/201711',
Gives the error line 2:2: missing 'column' at 'partition' (service: amazonathena; status code: 400; error code: invalidrequestexception;
Any help would be appreciated