How should protobuf message with repeated fields be converted to parquet to be queried by Athena?

Question

We write parquet files to S3 and then use Athena to query from that data. We use "parquet-protobuf" library to convert proto message into parquet record. We recently added a repeated field into our proto message definition and we were expecting to be able to fetch/query that using Athena.

message LedgerRecord { .... repeated uint64 shared_account_ids = 168; //newly added field }

Resulting parquet schema as seen by parquet-tools inspect command. ############ Column(shared_account_ids) ############ name: shared_account_ids path: shared_account_ids max_definition_level: 1 max_repetition_level: 1 physical_type: INT64 logical_type: None converted_type (legacy): NONE

Athena table alternation alter TABLE order_transactions_with_projections add columns (shared_account_ids array);

Simple Query that fails With T1 as (( SELECT DISTINCT record_id, msg_type, rec, time_sent, shared_account_ids FROM ledger_int_dev.order_transactions_with_projections WHERE environment='int-dev-cert' AND ( epoch_day>=19082 AND epoch_day<=19172) AND company='12' )) select * from T1 ORDER BY time_sent DESC LIMIT 100

Error: HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<file_name>.snappy.parquet (offset=0, length=48634): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO

How do I convert protobuf message with "repeated" field into parquet so that Athena understands the field as an ARRAY instead of primitive type ? Is intermediate conversion to AVRO necessary as mentioned in https://github.com/rdblue/parquet-avro-protobuf/blob/master/README.md or parquet-mr library can directly be used?

How should protobuf message with repeated fields be converted to parquet to be queried by Athena?

0 Answers0