1

I'm trying to query data from S3 using AWS Athena, where the data is stored in Parquet format. Specifically, I am trying to create a nested schema that stores rows of a complex object, generated using the parquetjs library. Here is an example of how I am generating the data:

const schema = {
  id: {type: 'UTF8'},
  body: {
    repeated: true,
    fields: {
      text: {type: 'UTF8'},
    },
  },
};
const obj = {
  id: '123',
  body: [
    {text: 'Hello'},
    {text: 'world!'},
  ],
};
const parquetSchema = new parquet.ParquetSchema(schema);

const writer = await parquet.ParquetWriter.openFile(parquetSchema, fileName);

In AWS Athena, I have created an external table with the following structure:

CREATE EXTERNAL TABLE `tabletest`(
  `id` string,
  `body` array<struct<text:string>>
)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://xyz/parquet_test'
TBLPROPERTIES (
  'classification'='parquet', 
  'transient_lastDdlTime'='1679107188')

However, when I try to query the data using SELECT * FROM tabletest, I get the following error:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://xyz/parquet_test/d1710fd7dde563dc9e0348211825e726 (offset=0, length=272): org.apache.parquet.io.PrimitiveColumnIO cannot be cast to org.apache.parquet.io.GroupColumnIO

I'm not sure what is causing this error or how to resolve it. Any suggestions or insights would be greatly appreciated.

Caesar
  • 9,483
  • 8
  • 40
  • 66
  • do you have the sample data some where? – Prabhakar Reddy Mar 20 '23 at 11:06
  • 1
    The most likely explanation to me is that the files created by parquetjs aren't well formed, or of a very old version. parquetjs is very old, it hasn't been maintained for years. I don't know how compatible its files were even at that point. You may have better luck with https://github.com/kylebarron/parquet-wasm – Theo Mar 20 '23 at 19:30
  • 1
    I'm having the same issue when using a struct rather than array. – Nicholas Porter Mar 26 '23 at 05:34

0 Answers0