I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:
/home/hduser/test/hr.txt
Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790
I then created my schema on top of the data as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';
I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:
SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
salary,
NTILE(${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
However, this throws an error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression
I tried to use quotes when calling the variable, as in '${hiveconf:q1}'
, but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:
SELECT lname,
salary,
NTILE(4) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
which yields
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Thoughts?