Is gzipped Parquet file splittable in HDFS for Spark?

Question

I get confusing messages when searching and reading answers on the internet on this subject. Anyone can share their experience? I know for a fact that gzipped csv is not, but maybe file internal structures for Parquet are such that it is totally different case for Parquet vs csv?

score 28 · Answer 1 · answered Apr 13 '17 at 11:20

28

Parquet files with GZIP compression are actually splittable. This is because of the internal layout of Parquet files. These are always splittable, independent of the used compression algorithm.

This fact is mainly due to the design of Parquet files that divided in the following parts:

Each Parquet files consists of several RowGroups, these should be the same size as your HDFS Block Size.
Each RowGroup consists of a ColumnChunk per column. Each ColumnChunk in a RowGroup has the same number of Rows.
ColumnChunks are split into Pages, these are probably in the size of 64KiB to 16MiB. Compression is done on a per-page basis, thus a page is the lowest level of parallelisation a job can work on.

You can find a more detailed explanation here: https://github.com/apache/parquet-format#file-format

answered Apr 13 '17 at 11:20

Uwe L. Korn

8,080
1
30
42

3

Thanks for your answer. Just want to confirm. These technically will be .gz.parquet files and not parquet.gz files, correct? It just products like Microsoft Polybase produce .gz files when exporting data externally in parquet format and I have not yet verified if it is the file which is compressed by itself or file chunks internals. – YuGagarin Apr 13 '17 at 14:05
6

Yes, they should be `gz.parquet`. The compression should be done inside of Parquet by the Parquet implementation. If you have a tool that first generates Parquet and then runs GZIP on them, these are actually invalid Parquet files. For Parquet it is essential that some parts of the format are not compressed (e.g. the header). These parts are tiny (often around one or two KiB) but compressing them would lead to significant performance loss. – Uwe L. Korn Apr 13 '17 at 16:18

Is gzipped Parquet file splittable in HDFS for Spark?

1 Answers1

Linked