1

I need to drop specific rows from a Hive table, which is partitioned. These rows for deletion matches certain conditions, so entire partitions can not be dropped in order to do so. Lets say the table Table has three columns: partner, date and source_key, and it is partitioned by date and source_key.

It is known that there is no operation supported for deletion or update of a particular set of records in hive (See How to delete and update a record in Hive) .

Following this solution I successfully executed the following queries in order to keep only the records which matches some given conditions, say: belong to some given range of date, having source_key='heaven' and the column partner<>'angel'...

Create temporary empty copy of the table Table.

CREATE TABLE IF NOT EXISTS tmpTable LIKE Table;

Fill it with the current rows.

INSERT OVERWRITE TABLE tmpTable
PARTITION (date,source_key)
SELECT * FROM Table
WHERE
date >= '2020-05-01' AND date < '2020-11-30' AND
source_key = 'heaven';

Delete target partitions.

ALTER TABLE Table DROP IF EXISTS
PARTITION (source_key = 'heaven' , date >= '2020-05-01' , date < '2020-11-30' );

Insert the edited partitions to the target table. (couldn't manage to insert OVERWRITE because of syntax errors)

INSERT INTO Table
PARTITION (source_key,date)
SELECT * FROM tmpTable
WHERE
partner <> 'angel';

Drop temporary table.

DROP TABLE IF EXISTS tmpTable;

The query runs fine. Because the table Table is managed, when the partitions are dropped the hdfs files should be dropped within, but something is wrong (perhaps in the last INSERT INTO statement) because after the execution of all these queries the target table Table keeps all records with partner = 'angel' in the given range of dates and with , basically stays the same.

Where is the fault? What is missing? How to accurately delete specific rows matching certain conditions for such a Hive table?

Dr Potato
  • 168
  • 1
  • 15
  • why partition specs are different: PARTITION (date,source_key) and PARTITION (source_key,date)? You do not need temporary table. Table partitions can be overwritten directly from select from itself. – leftjoin Apr 30 '21 at 07:19
  • According to https://stackoverflow.com/questions/33230686/what-will-happen-if-a-hive0-13-select-and-insert-overwrite-are-running-at-the?rq=1 it is not possible to select and insert into de same table. I have not tried so though. – Dr Potato May 01 '21 at 01:00
  • This is not how it works in Hive. First, select is executed, result stored in temp directory and if no exceptions, after that result is moved to the target table directory (select and write are not concurrent). (Of course the table is not in ACID mode). In ACID mode OVERWRITE is not possible. INSERT OVERWRITE in non-ACID table from itself works as I decribed. That question is about different sessions. Just try and you will see how it works. – leftjoin May 01 '21 at 05:41

1 Answers1

0

Table partitions can be overwritten directly from select from itself + WHERE filter. The scenario is quite simple, you do not need any temporary table. Make backup table if you are not sure what will happen.

  1. If you want to drop entire partitions (not overwrite), execute

    ALTER TABLE TableName DROP IF EXISTS
    PARTITION (<partition spec to be dropped>); --check partition spec to be dropped carefully
    

Skip this if no partitions to be dropped.

  1. Overwrite other partitions with filtered rows:

    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    set hive.allow.move.on.s3=true; --If you are on Qubole/S3
    
    insert overwrite table TableName partition (date, source_key ) --partition spec should match table DDL
    select * from TableName 
     where <condition> --condition should be True for rows which NOT be deleted
    

Your code is rather confusing because you created temporary table using LIKE but using different partition specification and selecting * (same order of columns like in original table). Order of columns shold match exactly, partition columns are the last ones, also in the same order.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • I will try this. However, if the table is huge and only few registres are to be dropped, this would be very expensive (because it overwites the whole table) and morever the `` statement would be complicated. I guess I can set the partitions to be overwritten in the `partition(...)` clause, right? then I would only need to state the `` for those partitions instead of overwriting the entire table, like `parititon(source_key,date>min_date, date – Dr Potato May 03 '21 at 03:29
  • Insert overwrite ...select will overwrite only partitions which present in the dataset returned by select, all other partitions will remain as they were before overwrite, just write correct WHERE ans check what select returns. – leftjoin May 03 '21 at 08:30
  • You can use static partition overwrite like this: parititon(source_key='some_key', date='some_date'). In drop partition you can use ><=, functions are not allowed in partition spec – leftjoin May 03 '21 at 08:32
  • Ok, as in https://stackoverflow.com/a/55758889/9106179 reads: `Insert overwrite table partition select ,,,` overwrites only partitions existing in the dataset returned by select. I'm going to try this out and let u know how it went... – Dr Potato May 03 '21 at 20:08
  • Indeed this query works, but in some cases fails because of memory error: `info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space`, therefore the query should be improved, any suggestions? – Dr Potato May 07 '21 at 21:33
  • @DrPotato OOM can be fixed, but I need all the details. On which vertex does it fail (Or are you running it on MR?) Please provide as much details as possible: Query plan, failed container logs taken from jobtracker. It will help to understand what is going wrong: is memory configuration issue or parallelism or writing too many partitions on the reducer. Please ask new question with details – leftjoin May 08 '21 at 08:27
  • Oh that's kind! At least I would include a `DISTRIBUTE BY date, source_key` clause after the `SELECT ... WHERE` statement, and if that is still to heavy for the cluster then set some other Hive variables, but that depends a lot on the use case :) – Dr Potato May 13 '21 at 23:25