We write parquet files to S3 and then use Athena to query from that data. We use "parquet-protobuf" library to convert proto message into parquet record. We recently added a repeated
field into our proto message definition and we were expecting to be able to fetch/query that using Athena.
message LedgerRecord { .... repeated uint64 shared_account_ids = 168; //newly added field }
Resulting parquet schema as seen by parquet-tools inspect
command.
############ Column(shared_account_ids) ############
name: shared_account_ids
path: shared_account_ids
max_definition_level: 1
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
Athena table alternation
alter TABLE order_transactions_with_projections
add columns (shared_account_ids array);
Simple Query that fails With T1 as (( SELECT DISTINCT record_id, msg_type, rec, time_sent, shared_account_ids FROM ledger_int_dev.order_transactions_with_projections WHERE environment='int-dev-cert' AND ( epoch_day>=19082 AND epoch_day<=19172) AND company='12' )) select * from T1 ORDER BY time_sent DESC LIMIT 100
Error: HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://<file_name>.snappy.parquet (offset=0, length=48634): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO
How do I convert protobuf message with "repeated" field into parquet so that Athena understands the field as an ARRAY instead of primitive type ? Is intermediate conversion to AVRO necessary as mentioned in https://github.com/rdblue/parquet-avro-protobuf/blob/master/README.md or parquet-mr library can directly be used?