0

I have a table in Hive which I would like to drop its partition keys for later using other partition keys.

The location of parquet file is in Amazon S3. The table, I am working on, is partitioned by date_year and date_month columns. There are 143 partitions altogether. Now I am trying to drop the partition by executing the following command:

Alter Table `my_hive_db`.`my_table`
Drop PARTITION (`date_year` , `date_month` );

I however, get this error:

Error while compiling statement: FAILED: ParseException line 48:28 mismatched input ',' expecting set null in drop partition statement.

enter image description here

If it helps, my table definition is as follows:

CREATE External Table `my_hive_db`.`my_table`(
    `col_id` bigint,
    `result_section__col2` string,
    `result_section_col3` string ,
    `result_section_col4` string,
    `result_section_col5` string,
    `result_section_col6__label` string,
    `result_section_col7__label_id` bigint ,
    `result_section_text` string ,
    `result_section_unit` string,
    `result_section_col` string ,
    `result_section_title` string,
    `result_section_title_id` bigint,
    `col13` string,
    `timestamp` bigint,
    `date_day` string
    )
    PARTITIONED BY ( 
      `date_year` string, 
      `date_month` string)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3a://some/where/in/amazon/s3';

The most important thing is, I really don't want to delete underlying files. I only want to drop the partition key so that later I can repartition the table with different combination of columns. The question is how can I alter the table, droping the partitions, but still keeping those partition keys in the table as normal columns.

I am open to achieve this by either Hive or Spark. However, Hive is more preferred at this stage.

Thank you for your valuable input.

E. Erfan
  • 1,239
  • 19
  • 37

1 Answers1

0

I do not think you can re-partition hive table based on different column(s). Because partitions are mapped to physical folders in HDFS and it can not be redistributed as needed.

So, only option is to -

  1. backup the table into a bkp table.
  2. drop original table and recreate the table with new partition.
  3. insert into new original table from backup.

Alternately you can create a new table with new partitions and insert from old table and then drop old and rename new table.

Koushik Roy
  • 6,868
  • 2
  • 12
  • 33
  • are you sure there is no dynamic way? I am basically looking for a way to write these two commands after each other: Alter Table my_table Drop Partition (date_year, date_month); Alter Table my_table Add Partition (new_col); But as I said earlier both are failing me. – E. Erfan Nov 25 '21 at 16:13
  • Following the process you suggested, I created a new table with the new partition key, and I tried to MSCK Repair Table my_table_with_new_PK; to get data in it. However, I got a series of new errors. For some reason, MSCK gives me the error Error while compiling statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.ddl.DDLTask. I am not sure what the reason is, but my initial guess is because of the null values in the new partition key column. – E. Erfan Nov 25 '21 at 16:17
  • This leads me to a different question, how to partition a table by a column with null values. I put this issue into a new link as how to partition a table by a column with null values. Can you please also have a look at it as well: https://stackoverflow.com/questions/70114230/hive-how-to-partition-by-a-colum-with-null-values-putting-all-nulls-in-one-par – E. Erfan Nov 25 '21 at 16:18
  • The most important thing is if there is a simple more smart way rather than recreating tables for repartitioning. – E. Erfan Nov 25 '21 at 16:19
  • Partition with null value will go to a default partition. – Koushik Roy Nov 25 '21 at 17:34
  • 1
    Wish there is a smarter way... but hive create folders for each partition which means os has to recreate folder structure again and distribute the data into them. So its something youre doing manually. – Koushik Roy Nov 25 '21 at 17:39