1

I am creating an external table that refers to ORC files in an HDFS location. That ORC files are stored in such a way that the external table is partitioned by date (Mapping to date wise folders on HDFS, as partitions). However, I am wondering if I can enforce 'Bucketing' on these external tables because the underlying data/files are not 'managed' by hive. They are written externally and hence can bucketing be used in Hive External Tables?

Hive is allowing me to use the 'CLUSTERED BY' clause while creating an external table. But I am not able to understand how hive will redistribute the data into buckets, what is already written on HDFS as ORC files?

I have seen similar questions on PARTITION AND BUCKETING in External tables here:

Hive: Does hive support partitioning and bucketing while usiing external tables

and

Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

but the answers are talking only about Partition support in external tables or bucket support in MANAGED tables. I am aware of both those options and am already using it but need specific answers about bucketing support in Hive EXTERNAL tables.

So, In summary, Do Hive External Tables support bucketing? If yes, how is the data in the external folder redistributed into buckets by hive?

1 Answers1

2

Yes, Hive does support bucketing and partitioning for external tables.

Just try it:

SET hive.tez.bucket.pruning=true;
SET hive.optimize.sort.dynamic.partition=true;
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.enforce.bucketing = true;

drop table stg.test_v1;
create external table stg.test_v1 
(
       id bigint
      ,name string
)
partitioned by (created_date string)
CLUSTERED BY(name) INTO 3 BUCKETS
stored as ORC
;

INSERT OVERWRITE TABLE stg.test_v1 partition(created_date)
SELECT 
      id, name, created_date
FROM
(select stack(3, 
             1, 'Valeriy', '2020-01-01',
             2, 'Victor', '2020-01-01',
             3, 'Ankit', '2020-01-01'
             ) as(id, name, created_date)
)s;

DESC FORMATTED says:

Table Type:          EXTERNAL_TABLE 
...
Num Buckets:         3                     
Bucket Columns:      [name] 

Load more rows and you will see, it will create 3 files per partition

See also this documentation for more details about features supported for Managed and External tables: Managed vs External Tables.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks leftjoin, for this answer. I had just tried the same thing right now and realised it is possible. came to answer my own question and saw this :) I also realised that we have to use "INSERT OVERWRITE" to take advantage of the bucketing feature. If I create an external Table over an existing data, the table is not even recognising the data!! Only if i do an insert overwrite, the data is copied into the target folder and is read by the bucketed table. – Sai Geetha M N Jul 31 '20 at 10:37
  • 2
    `msck repair table` should also work in case of creating table on existing data. – wypul Nov 14 '20 at 07:49