14

I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster?

INSERT OVERWRITE TABLE SOME_TABLE
    SELECT * FROM OTHER_TABLE;

or is faster to do like this:

DROP TABLE SOME_TABLE;
CREATE TABLE SOME_TABLE (STUFFS);
INSERT INTO TABLE
    SELECT * FROM OTHER_TABLE;

The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the INSERT OVERWRITE with billion of rows is faster than DROP + CREATE + INSERT INTO?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Thiago Baldim
  • 7,362
  • 3
  • 29
  • 51
  • 1
    `Insert OVERWRITE` make sure that previous data is been removed from HDFS and only new data available in table/partition. `DROP + CREATE + INSERT INTO` might not guaranty the same. – Ambrish Sep 21 '16 at 18:00
  • 1
    @Ambrish why is that/can you provide a source? I would assume drop table removes the table and all old data entirely. – Brendan Sep 13 '19 at 20:05

2 Answers2

15

For maximum speed I would suggest to 1) issue hadoop fs -rm -r -skipTrash table_dir/* first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do INSERT OVERWRITE command. This will be faster also because you do not need to drop/create table.

UPDATE:

As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than rm -skipTrash + INSERT OVERWRITE or DROP+CREATE+INSERT because it will be a single Hive-only command.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • is this applicable for partitions as well? If auto.purge is set to true, when insert overwrite to partition, that partitions old data will also not be moved to trash? – uptoNoGood Jan 25 '22 at 12:59
  • 1
    @uptoNoGood yes, it should delete old data in partitions being overwritten without moving to thrash – leftjoin Jan 25 '22 at 16:53
6

One edge consideration is that if your schema changes, INSERT OVERWRITE will fail, while DROP+CREATE+INSERT will not. While this is unlikely to apply in most scenarios, if you're prototyping workflow/table schemas then it might be worth considering.

Brendan
  • 1,905
  • 2
  • 19
  • 25