1

We write parquet files to S3 and then use Athena to query from that data. We use "parquet-protobuf" library to convert proto message into parquet record. We recently added a repeated field into our proto message definition and we were expecting to be able to fetch/query that using Athena.

message LedgerRecord { .... repeated uint64 shared_account_ids = 168; //newly added field }

Resulting parquet schema as seen by parquet-tools inspect command. ############ Column(shared_account_ids) ############ name: shared_account_ids path: shared_account_ids max_definition_level: 1 max_repetition_level: 1 physical_type: INT64 logical_type: None converted_type (legacy): NONE

Athena table alternation alter TABLE order_transactions_with_projections add columns (shared_account_ids array);

Simple Query that fails With T1 as (( SELECT DISTINCT record_id, msg_type, rec, time_sent, shared_account_ids FROM ledger_int_dev.order_transactions_with_projections WHERE environment='int-dev-cert' AND ( epoch_day>=19082 AND epoch_day<=19172) AND company='12' )) select * from T1 ORDER BY time_sent DESC LIMIT 100

Error: HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<file_name>.snappy.parquet (offset=0, length=48634): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO

How do I convert protobuf message with "repeated" field into parquet so that Athena understands the field as an ARRAY instead of primitive type ? Is intermediate conversion to AVRO necessary as mentioned in https://github.com/rdblue/parquet-avro-protobuf/blob/master/README.md or parquet-mr library can directly be used?

user2903819
  • 180
  • 2
  • 12

0 Answers0