Parquet partitioning and HDFS filesize

Question

My data are in the form of relatively small Avro records, written in Parquet files (on average < 1mb).

Up to now I used my local filesystem to do some tests with Spark.

I partitioned the data using a hierarchy of directories.

I wonder if it would be better to "build" the partitioning onto the Avro record and accumulate bigger files... However I imagine that partitioned Parquet files would "map" onto HDFS partitioned files too.

What approach would be best?

Edit (clarifying based on comments):

"build the partitioning onto the Avro record": imagine that my directory structure is P1=/P2=/file.avro and that the Avro record contains fields F1 and F2. I could save all of that in a single Avro file containing the fields P1, P2, F1 and F2. Ie there is no need for a partitioning structure with directories as it is all present in the Avro records
about Parquet partitions and HDFS partitions: will HDFS split a big Parquet file on different machines, will that correspond to distinct Parquet partitions ? (I don't know if that is clarifying my question - if not that means I don't really understand)

I could not understand what you are asking exactly (what do you mean by "build the partioning onto the Avro record" and "partioned Parquet files would map onto HDFS partioned files") but I'll try to answer. generally speaking from my expirience it is always better to work with larger files (I usually use sizes of 100M-1G per file). also when Partitioning you should avoid creating folders with small amount of data. if you want to create larger parquet files use coalesce(). finally when reading from HDFS parquet will match partitions to input files so if that what you asked then yes. — Tal Joffe, Aug 23 '16 at 11:37
@TalJoffe Thanks for your answer, I'll think about it. I clarified my questions, is that better? I did not know about `coalesce()`, that's probably where I should look. — Cedric H., Aug 23 '16 at 12:18
o.k. great. I saw your edit so I put an answer to try and help.let me know if that answered your question — Tal Joffe, Aug 23 '16 at 12:38

score 1 · Answer 1 · edited May 23 '17 at 12:33

the main reasoning behind using partitioning on folder level is that when Spark for instance reads the data and there is a filter on the partitioned column (extracted from the folder name as long as the format is path/partitionName=value) it will only read the needed folders (instead of reading everything and then applying filter). so if you want to use this mechanism use hierarchy in your folder structure (I use it often).

generally speaking I would recommend avoiding many folders with little data in them (not sure if is the case here)

about Spark input partitioning (same word different meaning), when reading from HDFS Spark will try to read files so that partitions will match files on HDFS (to prevent shuffling) so if data is partitioned by HDFS spark will match the same partitions. To my knowledge HDFS does not partition files rather it replicates them (to increase reliability) so I think a single large parquet file will translate to a single file on HDFS which will be read into a single partition unless you repartition it or define number of partition when reading (there are several ways to do it depending on Spark version. see this)

Parquet partitioning and HDFS filesize

1 Answers1