Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

Question

I am trying to create a table in Hive

CREATE TABLE BUCKET_TABLE AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll
CLUSTERED BY (key) INTO 1000 BUCKETS;

This syntax is failing - but I am not sure if it is even possible to do this combined statement. Any ideas?

try CREATE TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS AS SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll ; — Patrick McCann, Apr 24 '15 at 16:33
@patrick it throws error FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does not support partitioning in the target table ..did you try this? — Rahul Sharma, Oct 14 '16 at 21:26

score 17 · Accepted Answer · edited Dec 08 '17 at 04:42

Came across this question and saw there was no answer provided. I looked further and found the answer in the Hive documentation.

This will never work, because of the following restrictions on CTAS:

The target table cannot be a partitioned table.
The target table cannot be an external table.
The target table cannot be a list bucketing table.

Source: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect%28CTAS

Furthermore https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
...
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
...
[AS select_statement];

Clustering requires the column to be defined and then the cfg goes to the As select_statement Therefore at this time it is not possible.

Optionally, you can ALTER the table and add buckets, but this does not change existing data.

CREATE TABLE BUCKET_TABLE 
STORED AS ORC AS 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll limit 0;
ALTER TABLE BUCKET_TABLE CLUSTERED BY (key) INTO 1000 BUCKETS;
ALTER TABLE BUCKET_TABLE SET TBLPROPERTIES ('transactional'='true');
INSERT INTO BUCKET_TABLE 
SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;

Too bad that `Hive` doesn't support it because `Cloudera` `Impala` [does it](https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html) — y2k-shubham, Apr 26 '19 at 20:14

serg · Answer 2 · 2016-03-25T08:29:36.743

Looks like it is impossible. When trying to submit such query to ambari:

CREATE TABLE ready_requests 
CLUSTERED BY (device) INTO 64 BUCKETS
as
SELECT ...;

I got something like: "CREATE-TABLE-AS-SELECT does not support partitioning in the target table" Even it is not correct message, looks like the bucketing is also not supported

There is documentation that partitions are not supported, please see link, for buckets there is not such information, but looks like we have the same problem

score 0 · Answer 3 · answered Apr 24 '19 at 11:49

CREATE TABLE BUCKET_TABLE AS SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;
Alter table bucket_table clustered by(key) into 64 buckets;
Insert overwrite table bucket_table SELECT a.* FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) WHERE b.key IS NUll;

Hope it helps you. No need to add transaction properties to true (present in the first comment) as TRANSACTION Properties need to be set to true when we need to enable ACID property and for that bucketing and orc format needs to be present.

Sambit Tripathy · Answer 4 · 2014-07-22T21:40:09.130

-1

You have to use a non-null column in your clustered by clause.

edited Jul 22 '14 at 21:40

answered Jul 22 '14 at 20:50

Sambit Tripathy

434
1
4
14

That gives me the same syntax error (FAILED: ParseException line 1:235 missing EOF at 'clustered' near 'NULL' (state=42000,code=40000)) – Andrew Jul 22 '14 at 21:17
@Andrew this does not work for you as you are selecting rows from table1 where the keys are null and then using clusterd by on those null keys. Clustered by should run on a non-null column. – Sambit Tripathy Jul 22 '14 at 21:37
No, table1 keys aren't null in the query. table2 keys are. – Andrew Jul 22 '14 at 21:50
But it looks like you are trying to pull all records with key as null. – Sambit Tripathy Jul 22 '14 at 22:02
1. Please try to change NUll to NULL 2. Try this query FROM TABLE1 a LEFT JOIN TABLE2 b ON (a.key=b.key) INSERT OVERWRITE TABLE BUCKET_TABLE SELECT a.* WHERE b.key IS NULL CLUSTERED BY (key) INTO 1000 BUCKETS; – Sambit Tripathy Jul 22 '14 at 22:21

Can I cluster by/bucket a table created via "CREATE TABLE AS SELECT....." in Hive?

4 Answers4

Linked