5

I have an external table that reads data from the HDFS location (/user/hive/warehouse/tableX) all files and created a external table in Hive.

Now, let's assume there's some pre-partitioning of the data and all the previous files are spitted in several directories with a specific name convention <dir_name>_<incNumber> e.g.

/user/hive/warehouse/split/
  ./dir_1/files...
  ./dir_2/files...
  ./dir_n/files...

how can I create another external table that keeps track of all files in the split folder?

Do I need to create an external table that is partitioned on each sub-folder (dir_x)?

Also for that, is it needed some kind of Hive or shell script that can create/add a partition for each sub-directory?

Alg_D
  • 2,242
  • 6
  • 31
  • 63

2 Answers2

5

You have to create an external table partitioned by dir_x to access all files in multiple folders.

CREATE external TABLE sample_table( col1 string,
                                    col2 string,
                                    col3 string,
                                    col4 string)
PARTITIONED BY (dir string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/split';

Then add the partition as you would to a regular partitioned table

ALTER TABLE sample_table ADD PARTITION(dir='dir_1')
LOCATION '/user/hive/warehouse/split/dir_1';
ALTER TABLE sample_table ADD PARTITION(dir='dir_2')
LOCATION '/user/hive/warehouse/split/dir_2';

This approach will work. There is an issue with this approach. If some time in the future you decide to add a new folder(e.g. dir_100) to the hive warehouse path, you will have to drop and recreate sample_table and re add all the partitions to sample_table again using ALTER TABLE statement. I haven't worked with hive for about 10 months now, so i am not sure if there is a better approach. If this is not an issue, you can use this approach.

dheee
  • 1,588
  • 3
  • 15
  • 25
  • 2
    Thanks for the reply. Indeed this is a solution, the problem is that a new directory is added in a regular basis (e.g every day). So something like a script that reads all the directories in ./split is needed to re-create the table. Not sure if this is very efficient or the best solution. – Alg_D Jun 01 '16 at 23:17
  • 1
    @Algina you can write a Oozie script to do that. oozie script will be able to drop and recreate the table and also add all the partitions back whenever there is a new entry. – dheee Jun 01 '16 at 23:26
2

Once you have your external table in place and you know the partition structure that hive would create. You can load the data files (into pre-partitioned folders) in your drive through some process. Once the pre-partitioned file is available (as per hive known partition structure), you can repair the table as below.

hive> MSCK REPAIR TABLE sample_table; 

The above statement will automatically sync up all your existing partitions to the hive meta store of the existing EXTERNAL TABLE "sample_table".

Srisatya
  • 91
  • 1
  • 1