5

I'm getting a schema mismatch error when querying parquet data from Athena.

The error is:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet (offset=0, length=48653579): Schema mismatch, metastore schema for row column totals has 13 fields but parquet schema has 12 fields

In the AWS Glue Crawler I tried enabling Update all new and existing partitions with metadata from the table which I thought would resolve this issue, however I'm still getting the above error. I did this because of the similar question: How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

The table schema for the totals column is:

struct<visits:bigint,hits:bigint,pageviews:bigint,timeOnSite:bigint,bounces:bigint,transactions:bigint,transactionRevenue:bigint,newVisits:bigint,screenviews:bigint,uniqueScreenviews:bigint,timeOnScreen:bigint,totalTransactionRevenue:bigint,sessionQualityDim:bigint>

The parquet file for partition eventDate=2017-06-11 is missing the last field "sessionQualityDim".

AndrewSC
  • 53
  • 1
  • 5
  • is your parquet data partitioned based on eventDate column? If yes then that would not be a part of schema in athena table. Instead, while creating the table pass eventDate column in partition by parameter. – Harsh Bafna May 21 '19 at 15:48
  • @HarshBafna Yes it is partitioned on eventDate. The table is created by a Glue crawler. The DDL for the table contains `PARTITIONED BY ('eventdate' string)` – AndrewSC May 21 '19 at 15:52
  • looks like s3://datalake/googleanalytics/version=0/eventDate=2017-06-11/part-00001-9c9312f7-f9a5-47c3-987e-9348b78aaebc-c000.snappy.parquet file has only 12 columns in its schema as compared to 13 defined in the athena table. Use https://github.com/apache/parquet-mr/tree/master/parquet-tools to find the schema of that file. – Harsh Bafna May 21 '19 at 16:32
  • please provide the output of "parquet-tools schema " command. – Harsh Bafna May 21 '19 at 16:33
  • @HarshBafna I don't have parquet-tools built at the moment. I have however used a Spark `DataFrame.printSchema()` to display the schema: – AndrewSC May 21 '19 at 17:21
  • 2017-06-11: `|-- totals: struct (nullable = true) |-- visits: long (nullable = true) |-- hits: long (nullable = true) |-- pageviews: long (nullable = true) |-- timeOnSite: long (nullable = true) |-- bounces: long (nullable = true) |-- transactions: long (nullable = true) |-- transactionRevenue: long (nullable = true) |-- newVisits: long (nullable = true) |-- screenviews: long (nullable = true) |-- uniqueScreenviews: long (nullable = true) |-- timeOnScreen: long (nullable = true) |-- totalTransactionRevenue: long (nullable = true)` – AndrewSC May 21 '19 at 17:22
  • now:`|-- totals: struct (nullable = true) |-- visits: long (nullable = true) |-- hits: long (nullable = true) |-- pageviews: long (nullable = true) |-- timeOnSite: long (nullable = true) |-- bounces: long (nullable = true) |-- transactions: long (nullable = true) |-- transactionRevenue: long (nullable = true) |-- newVisits: long (nullable = true) |-- screenviews: long (nullable = true) |-- uniqueScreenviews: long (nullable = true) |-- timeOnScreen: long (nullable = true) |-- totalTransactionRevenue: long (nullable = true) |-- sessionQualityDim: long (nullable = true)` – AndrewSC May 21 '19 at 17:23
  • The `sessionQualityDim` is not in the 2017 schema. The 2017 schema is a subset of the table schema. I expected the *Update all new and existing partitions with metadata from the table* setting to work. – AndrewSC May 21 '19 at 17:25
  • In the table schema you shared in the original post, I can't see the totals column, which is a part of your dataframe, which I assume is created from the same parquet file mentioned in the question. – Harsh Bafna May 21 '19 at 17:25
  • Ok, the same athena table cannot point to parquet files with different schema. You will need two different athena tables to read the older version and newer version of data. – Harsh Bafna May 21 '19 at 17:28

1 Answers1

1

You have parquet files with two different schema and the Athena table schema matches with the newer one. You can do one of the following :

1) Create two different tables in athena, one pointing to data till 2017 and other pointing to data post 2017.

2) In case the older data is no more valid for current use case, then you can simply archive that data and remove the 2017 and older partitions from your current table.

Harsh Bafna
  • 2,094
  • 1
  • 11
  • 21
  • If the schema change was "Add column", can this be somehow worked around? – TechCrap Feb 12 '20 at 12:45
  • 1
    @TechCrap : The above solutions are workaround only. As Athena is not a tradition RDBMS but imposes schema on top of data lying in S3. The problem here is data with different schema, so the table can read either one of the data. the old one or the new one. – Harsh Bafna Feb 12 '20 at 12:50