1

I was going through the answer given by Roberto in the following post.

What is the difference between partitioning and bucketing a table in Hive ?

And it seems like partitioning data over date (if my data is coming daily) is not a good idea, as it will end up creating many directories and files in HDFS and will decrease the overall performance of the query?

What should I do in the case where I have a business requirement in which date is going to be used more often for querying data?

Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137
  • If date is going to be used more often, then we can create a Hive (If you dont need live data) External Table with Date partition and the Query data with Date partition in where condition. – Anupam Alok Feb 28 '18 at 10:18
  • "and end up creating many files"... That depends entirely on the volume of data. For example, if you're streaming it from Kafka, then you need to use a separate process to compact "realtime" data into "batch" optimized data, which means less files, of larger sizes – OneCricketeer Feb 28 '18 at 13:19

1 Answers1

0

There's absolutely nothing wrong with using the date as a partition. In fact it's one of the most commonly-used partitioning values. 365 extra directories a year won't make any difference to the performance of your cluster.

As for it changing the number of files: if you're ingesting data daily then the number of files won't change whether or not you partition on date. The only difference will be which directories the files are stored in. Given that you'll be frequently querying based on date, you absolutely should be partitioning over date.

Roberto's point is valid, but he's talking about situations where you have far more partitions than you're considering using. As per a Hortonworks employee:

current Hive versions with RDBMS metastore backend should be able to handle 10 000+ partitions.

So you should partition by date, but add a Jira ticket to your backlog to re-evaluate this in 300 years.

Ben Watson
  • 5,357
  • 4
  • 42
  • 65
  • As an aside, it's obviously not a direct comparison, but to demonstrate the normal scale of HDFS, the field `dfs.namenode.fs-limits.max-directory-items` - which defines the maximum number of files that can exist inside one directory in the HDFS - is set at a default value of `1048576`. – Ben Watson Feb 28 '18 at 09:09