My data are in the form of relatively small Avro records, written in Parquet files (on average < 1mb).
Up to now I used my local filesystem to do some tests with Spark.
I partitioned the data using a hierarchy of directories.
I wonder if it would be better to "build" the partitioning onto the Avro record and accumulate bigger files... However I imagine that partitioned Parquet files would "map" onto HDFS partitioned files too.
What approach would be best?
Edit (clarifying based on comments):
"build the partitioning onto the Avro record": imagine that my directory structure is P1=/P2=/file.avro and that the Avro record contains fields F1 and F2. I could save all of that in a single Avro file containing the fields P1, P2, F1 and F2. Ie there is no need for a partitioning structure with directories as it is all present in the Avro records
about Parquet partitions and HDFS partitions: will HDFS split a big Parquet file on different machines, will that correspond to distinct Parquet partitions ? (I don't know if that is clarifying my question - if not that means I don't really understand)