0

We have 100s of HDFS partitions that we write to each hour of the day. The partitions are per day to make loading into Hive straight-forward, and the data is written in Parquet format.

The issue we run into is that because we want to get the data queryable as fast as possible, the hourly writing results in lots of small files.

There are plenty of examples such as How to combine small parquet files to one large parquet file? for the combining code; my question is how do you avoid breaking people's active queries while moving/substituting in the newly compacted files for the small ones?

Mad Dog
  • 583
  • 4
  • 18
  • You could use Hive Streaming w/ ORC files which runs its own compaction processor – OneCricketeer Aug 31 '18 at 02:07
  • The data is written from Spark jobs originally.. – Mad Dog Aug 31 '18 at 02:45
  • Okay... https://github.com/jerryshao/spark-hive-streaming-sink – OneCricketeer Aug 31 '18 at 02:46
  • In any case, as you've found, HDFS does not like small files, and you cannot offer queries as quick as possible with just HDFS and a Hive table over it... The recommended pattern is generally putting into an actual database such as HBase, then export chunks of it for longer-term storage with Hive – OneCricketeer Aug 31 '18 at 02:49

1 Answers1

1

The metastore has a filesystem location for each partition. This location is often based on the table and partition:

hdfs://namenode/data/web/request_logs/ds=2018-05-03

However, the location can be completely arbitrary, so you can utilize this to implement snapshot isolation or versioning. When you compact the files in the partition, write the new files into a new location:

hdfs://namenode/data/web/request_logs/v2_ds=2018-05-03

After the compaction is done, update the partition location in the metastore to point to the new location. Finally, cleanup the old location sometime in the future after no queries are using it.

David Phillips
  • 10,723
  • 6
  • 41
  • 54
  • I had thought the ds portion of your partition had to match the name of the partition key in the table from hive perspective? – Mad Dog Aug 30 '18 at 18:33
  • The path can be any arbitrary value. – Dain Sundstrom Aug 30 '18 at 18:38
  • @DainSundstrom I don't think that was the point. Users would say `WHERE ds=x`, then you introduce `v2_ds=y`, and users have to rewrite the queries to get the new data – OneCricketeer Aug 31 '18 at 02:09
  • I don't feel asking a large number of users to re-write their queries is feasible.. – Mad Dog Aug 31 '18 at 02:42
  • The location (path) returned from the metastore can be anything. The only reason it contains the table/partiton is because that’s what Hive, Presto, etc. use as a default. However, you are free to change it to anything you want. The partiton values are stored separately in the metastore. – David Phillips Aug 31 '18 at 06:17
  • @MadDog you can change the base hdfs path (hdfs://namenode/data/web/request_logs_temp/) and keep the partition column same. – Ashish Sep 07 '18 at 06:38