3

I have a number of hive warehouses. The data resides in parquet files in Amazon S3. Some of the tables contain TB of data. Currently in hive most tables are partitioned by a combination of month and year, both of which are saved mainly as string. Other fields are either bigint, int, float, double, string and unix timestamps. Our goal is to migrate them to apache iceberg tables. The challenge is how to choose the partition keys.

I have already calculated the cardinality of each field in each table by:

Select COUNT(DISTINCT my_column) As my_column_count
From my_table;

I have also calculated the percentage of null values for each field:

SELECT 100.0 * count(*)/number_of_all_records
FROM my_db.my_table
Where my_column IS NULL; 

In short I already know three things for each field:

  1. Data type
  2. Cardinality
  3. Percentage of null values

By knowing these three pieces of information, my question is how should I choose the best column or combination of columns as partition keys for my future iceberg tables? Are there any rule of thumbs?

How many partitions is considered as optimized when choosing partition keys? What data type is best when choosing partition keys? What are other factors that need to be considered? Is bucketing the same in iceberg tables as it is in hive and how it can be leveraged by the partition keys? What data types are best for partition keys? Is it better to have many small partitions or having a few big partitions? Any other aspects in partition keys that need to be considered?

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
E. Erfan
  • 1,239
  • 19
  • 37
  • I am not very familiar with Iceberg, but I'd say the rule of thumb is similar to the one for Hive. There are many things to take into consideration, the most important being "how is your table queried most of the time, what type of queries do you want to optimize?". I think you should give more details about what you are trying to do exactly, and what the big picture is. What are the problems you are facing with your current table ? What columns are you considering at the moment ? Why ? (also read this: https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – FurryMachine Dec 04 '21 at 10:46

1 Answers1

1

One crucial part is missing from your description - the queries. You need to understand what are the queries that will run on this data. Understanding the queries that will run on the data (to the best you can) is super important. For example, consider a simple table with: Date, Id, Name, Age as columns. If the queries are date based meaning, it will query the data in the context of dates,

select * from table where date > 'some-date'

then it's a good idea to partition by date.

However, if the queries are age related

 select * from table where age between 20 and 30

then you should consider partition by age or age groups

roizaig
  • 116
  • 7