I have a number of hive warehouses. The data resides in parquet files in Amazon S3. Some of the tables contain TB of data. Currently in hive most tables are partitioned by a combination of month and year, both of which are saved mainly as string. Other fields are either bigint, int, float, double, string and unix timestamps. Our goal is to migrate them to apache iceberg tables. The challenge is how to choose the partition keys.
I have already calculated the cardinality of each field in each table by:
Select COUNT(DISTINCT my_column) As my_column_count
From my_table;
I have also calculated the percentage of null values for each field:
SELECT 100.0 * count(*)/number_of_all_records
FROM my_db.my_table
Where my_column IS NULL;
In short I already know three things for each field:
- Data type
- Cardinality
- Percentage of null values
By knowing these three pieces of information, my question is how should I choose the best column or combination of columns as partition keys for my future iceberg tables? Are there any rule of thumbs?
How many partitions is considered as optimized when choosing partition keys? What data type is best when choosing partition keys? What are other factors that need to be considered? Is bucketing the same in iceberg tables as it is in hive and how it can be leveraged by the partition keys? What data types are best for partition keys? Is it better to have many small partitions or having a few big partitions? Any other aspects in partition keys that need to be considered?