Design of a Hive pipeline

Question

I am trying to find the best way to implement the following pipeline in Hive and HDFS:

I would like to ingest a csv (no problem in there so far) but, I want to the partitions to be made from a field that comes informed into the csv files.

I created an external table that deposits the file on the HDFS and defined the partition field, but with a simple put (which makes sense) the partitions are not made and I get an exit code 1 when trying to drop the msck repair table.

I would like to know if the following is possible or viable to do:

Load the CSV file into an internal hive table and as a temporal table.
Do the insert into the "official" table with the partitions

Is this an efficient way to do it? If so, I havent found a lot of information about how to do the first.

Thanks.

Hi @leftjoin thank you, struggling on how to do this. I guess you mean using a Load Data statement? — Alejandro A, Mar 20 '19 at 07:59
Yes. Are your files contain single partition in single file or many partitions in single file? — leftjoin, Mar 20 '19 at 08:34
@leftjoin There should be only one partition per file, but it could be that sometime there is more than one. Thanks for pointing me to the right direction, I am still to plan how I would do everything: Load the file in a temp directory and then load the data in the "official table" I guess — Alejandro A, Mar 20 '19 at 08:58
If it were only one partition per file then you cold load directly into target table partition. The worst scenario If there are many partitions per file then you need to load into intermediate table first and then use insert overwrite - select from intermediate table. And you need to load (overwrite) all entire partition data at once, if not possible then select should merge data already existing in target table and incremental load, — leftjoin, Mar 20 '19 at 09:27
@leftjoin thakn you very much, feel free to post your message as an answer so I can give it as an answer, it really helped — Alejandro A, Mar 22 '19 at 09:21

leftjoin · Accepted Answer · 2019-03-22T10:50:15.117

If it were only one partition per file then you could load directly into target table partition using

hive> LOAD DATA LOCAL INPATH 'yourpath/yourfile.csv' OVERWRITE INTO TABLE your_table PARTITION (key='value');

The worst scenario If there are many partitions per file then you need to load into intermediate table first and then re-partition data using insert-overwrite-select:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite official_table partition(partition_key) 
 select col1, col2, partition_col --partition is the last one
from intermediate table...

And you need to load (overwrite) all entire partition data at once, if not possible then select should merge data already existing in target partition with incremental load.

See this answer about how to merge incremental updates.

Also INSERT (NOT OVERWRITE) INTO table (partition) select ... is possible solution for inserting incremental data. This statement should load data into additional files in the partition folder, existing data will remain as is.

Design of a Hive pipeline

1 Answers1