2

I'm a newbie on hive, so a basic question: How do I create a query such that the result of that query is partitioned in a specific way?

For example:

CREATE TABLE IF NOT EXISTS tbl_x (
 x SMALLINT,
 y FLOAT)
PARTITIONED BY (id SMALLINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;

INSERT INTO TABLE `tbl_x` 
VALUES (1, 1, 1.0),
       (1, 1, 2.0),
       (1, 2, 3.0),
       (1, 2, 4.0),
       (2, 1, 5.0),
       (2, 1, 6.0),
       (2, 2, 7.0),
       (2, 2, 8.0);

CREATE TABLE tbl_y AS SELECT `id`, `x`, SUM(`y`) AS `y_sum`
FROM `tbl_x`
GROUP BY `id`, `x`;

In that example, I'd like tbl_y to be partitioned too.

Trying this doesn't work:

CREATE TABLE tbl_y AS SELECT `id`, `x`, SUM(`y`) AS `y_sum`
FROM `tbl_x`
GROUP BY `id`, `x` PARTITIONED BY (id SMALLINT);  

What is the trick here? Should I define the partitioned table first and insert the results in?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Leo Barlach
  • 480
  • 3
  • 13
  • Possible duplicate of [CTAS with Dynamic Partition](https://stackoverflow.com/questions/43241648/ctas-with-dynamic-partition) – Pushkr Apr 05 '19 at 01:05

1 Answers1

2

Yes, you should create a partitioned table separately. Create partitioned table as select (CTAS) is not supported.

CREATE TABLE tbl_y(x smallint,y_sum double)
partitioned by (id smallint)
STORED AS ORC;

If the table schema is the same, you can use CREATE LIKE:

CREATE TABLE tbl_y like tbl_x;

You also can use DISTRIBUTE BY to distribute the data evenly between reducers, see this answer also: https://stackoverflow.com/a/38475807/2700344

insert overwrite table tbl_y partition(id)
select id, x, SUM(y) AS y_sum
fromtbl_x
group by id, x 
distribute by id, FLOOR(RAND()*100.0)%20;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Thanks. I think there's a typo there on the DISTRIBUTE bit, since you added it twice. I also got this error: `Completed executing command(queryId=hive_20190405103559_37aabd69-88dc-4137-8289-aa35357e1385); Time taken: 25.555 seconds Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.StatsTask (state=08S01,code=1)` – Leo Barlach Apr 05 '19 at 15:37
  • @LeoBarlach This is not informative. Dig Job tracker logs of failed container. – leftjoin Apr 05 '19 at 15:43