2

I can download a single snappy.parquet partition file with:

aws s3 cp s3://bucket/my-data.parquet/my-data-0000.snappy.parquet ./my-data-0000.snappy.parquet

And then use:

parquet-tools head my-data-0000.snappy.parquet
parquet-tools schema my-data-0000.snappy.parquet
parquet-tools meta my-data-0000.snappy.parquet

But I'd rather not download the file, and I'd rather not have to specify a particular snappy.parquet file. Instead the prefix: "s3://bucket/my-data.parquet"

Also what if the schema is different in different row groups across different partition files?

Following instructions here I downloaded a jar file and ran

hadoop jar parquet-tools-1.9.0.jar schema s3://bucket/my-data.parquet/

But this resulted in error: No FileSystem for schema "s3".

This answer seems promising, but only for reading from HDFS. Any solution for S3?

Wassadamo
  • 1,176
  • 12
  • 32

2 Answers2

2

I wrote the tool clidb to help with this kind of "quick peek at a parquet file in S3" task.

You should be able to do:

pip install "clidb[extras]"
clidb s3://bucket/

and then click to load parquet files as views to inspect and run SQL against.

0

You can use this aws cli command, it works for files larger than 128 MB as opposed to S3 Select in AWS Console. You need to specify the file directly though. For different schemas in row groups you will need a more robust solution but to me that is outside of scope for a "quick peek".

aws s3api select-object-content \
--bucket bucket \
--key "my-data.parquet/my-data-0000.snappy.parquet" \
--expression "select * from s3object limit 100" \
--expression-type 'SQL' \
--input-serialization '{"Parquet": {}, "CompressionType": "NONE"}' \
--output-serialization '{"JSON": {}}' "output.json"

The command will create output.json file with the output.