Hive, how to partition by a colum with null values, putting all nulls in one partition

Question

I am using Hive, and the IDE is Hue. I am trying different key combinations to choose for my partition key(s).

The definition of my original table is as follows:

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_day` string
    )
    PARTITIONED BY ( 
      `date_year` string, 
      `date_month` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

The above code is working properly. But when I create a new table with date_day as partition key, the table is empty and I need to run MSCK Repair Table. However, I am getting the following error:

Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask

When the partion keys were date_year, date_month, MSCK worked properly.

Table definition of the table I am getting the error for is as follows:

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
  )
    PARTITIONED BY (
     `date_day` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

After this the following query is empty:

Select * From `my_hive_db`.`my_table` Limit 10;

I therefore ran the following command:

MSCK REPAIR TABLE `my_hive_db`.`my_table`;

And I get the error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask

I checked this link as it is exactly the error I am getting, but by using the solution provided:

set hive.msck.path.validation=ignore;
MSCK REPAIR TABLE table_name;

I get a different error:

Error while processing statement: Cannot modify hive.msck.path.validation at runtime. It is not in list of params that are allowed to be modified at runtime.

I think the reason I am getting these errors is that there are more than 200 million records with date_day having null value.

There are 31 distinct date-day not null values. I would like to partition my table with 32 partitions, each for a distinct value of date_day field, and all the null values get into a different partition. Is there a way to do so (partitioning by a column with null values)?

If this can be achieved by spark, I am also open to use it.

This is part of a bigger problem of changing partition keys by recreating a table as mentioned in this link in answer to my other question.

Thank you for your help.

Cannot modify hive.msck.path.validation at runtime. --> you just need to alter your configuration and restart hive to fix this issue. Or add that value to list of parms that you can modify at runtime. — Matt Andruff, Nov 25 '21 at 16:22
It looks like you are trying to repartition existing table. partitions are folders under table location, containing datafiles. you cannot change partitioning schema on existing table. data should be organized in folders according to partitioning. Create new table and load data from first table, See https://stackoverflow.com/a/53598283/2700344 and https://stackoverflow.com/a/68760607/2700344 and https://stackoverflow.com/a/68740919/2700344 — leftjoin, Nov 25 '21 at 16:25
To Add to leftjoin's comment. Hive is schema on read, it holds definitions to interpret data. It doesn't alter data if you change the definition of how to read data. It just tries to read the data as you defined it. — Matt Andruff, Nov 25 '21 at 16:27
About nulls in partition columns: replace them with something else using NVL() when doing insert. partition column does not exist in the data file, it is only metadata+ folder in s3, which looks like key=value. If value is null during insert, hive will put such records into __HIVE_DEFAULT_PARTITION__ folder. It will create additional mess. So replace nulls with something else — leftjoin, Nov 25 '21 at 16:37
Matt I cannot change the configuration as it is managed by data lake team. I am using the platform as an external developer. Is there any other way? — E. Erfan, Nov 25 '21 at 18:33
As leftjoin mentioned, all nulls are being partitioned in Hive default_partition. Still why do I get this error: Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask? What is the source of this error? Then it has nothing to do with null values in the partition key? — E. Erfan, Nov 25 '21 at 18:35

score 1 · Accepted Answer · answered Nov 26 '21 at 20:48

You seem to not understand how Hive's partitioning work. Hive stores data into files on HDFS (or S3, or some other distributed folders). If you create a non-partitioned parquet table called my_schema.my_table, you will see in your distributed storage files stored in a folder

hive/warehouse/my_schema.db/my_table/part_00001.parquet
hive/warehouse/my_schema.db/my_table/part_00002.parquet
...

If you create a table partitioned by a column p_col, the files will look like

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value1/part_00002.parquet
...
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00002.parquet
...

The command MSCK repair table allows you to automatically reload the partitions, when you create an external table.

Let's say you have folders on s3 that look like this:

hive/warehouse/my_schema.db/my_table/p_col=value1/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value2/part_00001.parquet
hive/warehouse/my_schema.db/my_table/p_col=value3/part_00001.parquet

You create an external table with

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_col STRING)

the table will be created but empty, because Hive hasn't detected the partitions yet. You run MSCK REPAIR TABLE my_schema.my_table, and Hive will recognize that your partition p_col matches the partitioning scheme on s3 (/p_col=value1/).

From what I could understand from your other question, you are trying to change the partitioning scheme of the table by doing

CREATE External Table my_schema.my_table(
   ... some columns ...
)
PARTITIONED BY (p_another_col STRING)

and you are getting an error message because p_another_col doesn't match with the column used in s3, which was p_col. And this error is perfectly normal, since what you are doing doesn't make sense.

As stated in the other question's answer, you need to create a copy of the first table, with a different partitioning scheme.

You should instead try something like this:

CREATE External Table my_hive_db.my_table_2(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_year` string, 
    `date_month` string
)
PARTITIONED BY (`date_day` string)

and then populate your new table with dynamic partitioning

INSERT OVERWRITE TABLE my_hive_db.my_table_2 PARTITION(date_day)
SELECT 
  col_id,
  result_section__col2,
  result_section_col3,
  result_section_col4,
  result_section_col5,
  result_section_col6__label,
  result_section_col7__label_id,
  result_section_text,
  result_section_unit,
  result_section_col,
  result_section_title,
  result_section_title_id,
  col13,
  timestamp,
  date_year,
  date_month,
  date_day
FROM my_hive_db.my_table_1

Thank you FurryMachine. I tried your solution the other day by creating a new table with new partition keys and inserting data into it by selectign from table1 with old partition keys. However, the table contains TB of data and it paralyzed the production warehouse for more than 24 hours causing other jobs to fail. When data is big, is there any other solution which is more performant? — E. Erfan, Dec 03 '21 at 08:56
Well, first I would suggest you to discuss this with a senior engineer from your company, before trying anything dangerous in prod. Surely you got their attention now. Then, a few suggestions: if your input table is partitioned by month and you want to make a table partitioned per day, you could run one query for each month, to make this more digest for your infrastructure. But most of all, try to POC this and evaluate the impacts before going all out. — FurryMachine, Dec 03 '21 at 13:58
Thanks FurryMachine. Do you also have experience with iceberg partition keys? I am trying to understand how they are different with hive: https://stackoverflow.com/questions/70211642/how-to-choose-partition-keys-for-apache-iceberg-tables — E. Erfan, Dec 03 '21 at 14:28
No, I am not very familiar with Iceberg, but I'd say you are probably looking in the right direction as it was designed to solve some of the partitioning problems you seem to be having. You probably already read this: https://iceberg.apache.org/#partitioning and this: https://developer.ibm.com/articles/the-why-and-how-of-partitioning-in-apache-iceberg/ — FurryMachine, Dec 04 '21 at 10:33

Hive, how to partition by a colum with null values, putting all nulls in one partition

1 Answers1

Linked