How do you avoid breaking active Hive/Presto queries while substituting compacted files for small files in HDFS?

Question

We have 100s of HDFS partitions that we write to each hour of the day. The partitions are per day to make loading into Hive straight-forward, and the data is written in Parquet format.

The issue we run into is that because we want to get the data queryable as fast as possible, the hourly writing results in lots of small files.

There are plenty of examples such as How to combine small parquet files to one large parquet file? for the combining code; my question is how do you avoid breaking people's active queries while moving/substituting in the newly compacted files for the small ones?

You could use Hive Streaming w/ ORC files which runs its own compaction processor — OneCricketeer, Aug 31 '18 at 02:07
Okay... https://github.com/jerryshao/spark-hive-streaming-sink — OneCricketeer, Aug 31 '18 at 02:46
In any case, as you've found, HDFS does not like small files, and you cannot offer queries as quick as possible with just HDFS and a Hive table over it... The recommended pattern is generally putting into an actual database such as HBase, then export chunks of it for longer-term storage with Hive — OneCricketeer, Aug 31 '18 at 02:49

score 1 · Accepted Answer · answered Aug 30 '18 at 17:27

1

The metastore has a filesystem location for each partition. This location is often based on the table and partition:

hdfs://namenode/data/web/request_logs/ds=2018-05-03

However, the location can be completely arbitrary, so you can utilize this to implement snapshot isolation or versioning. When you compact the files in the partition, write the new files into a new location:

hdfs://namenode/data/web/request_logs/v2_ds=2018-05-03

After the compaction is done, update the partition location in the metastore to point to the new location. Finally, cleanup the old location sometime in the future after no queries are using it.

answered Aug 30 '18 at 17:27

David Phillips

10,723
6
41
54

I had thought the ds portion of your partition had to match the name of the partition key in the table from hive perspective? – Mad Dog Aug 30 '18 at 18:33
The path can be any arbitrary value. – Dain Sundstrom Aug 30 '18 at 18:38
@DainSundstrom I don't think that was the point. Users would say `WHERE ds=x`, then you introduce `v2_ds=y`, and users have to rewrite the queries to get the new data – OneCricketeer Aug 31 '18 at 02:09
I don't feel asking a large number of users to re-write their queries is feasible.. – Mad Dog Aug 31 '18 at 02:42
The location (path) returned from the metastore can be anything. The only reason it contains the table/partiton is because that’s what Hive, Presto, etc. use as a default. However, you are free to change it to anything you want. The partiton values are stored separately in the metastore. – David Phillips Aug 31 '18 at 06:17
@MadDog you can change the base hdfs path (hdfs://namenode/data/web/request_logs_temp/) and keep the partition column same. – Ashish Sep 07 '18 at 06:38

How do you avoid breaking active Hive/Presto queries while substituting compacted files for small files in HDFS?

1 Answers1