We have 100s of HDFS partitions that we write to each hour of the day. The partitions are per day to make loading into Hive straight-forward, and the data is written in Parquet format.
The issue we run into is that because we want to get the data queryable as fast as possible, the hourly writing results in lots of small files.
There are plenty of examples such as How to combine small parquet files to one large parquet file? for the combining code; my question is how do you avoid breaking people's active queries while moving/substituting in the newly compacted files for the small ones?