0

I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:

/home/hduser/test/hr.txt

Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790

I then created my schema on top of the data as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';

I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:

SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
       salary,
       NTILE(${hiveconf:q1}) OVER (
                      ORDER BY salary DESC) AS quartile
FROM employees;

However, this throws an error:

FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression

I tried to use quotes when calling the variable, as in '${hiveconf:q1}', but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:

SELECT lname,
       salary,
       NTILE(4) OVER (
                      ORDER BY salary DESC) AS quartile
FROM employees;

which yields

Berg    12000   1
Faviet  9000    1
Paino   8790    2
Chen    8200    2
Urman   7800    3
Sciarra 7700    3
Popp    6900    4

Thoughts?

Community
  • 1
  • 1
sedeh
  • 7,083
  • 6
  • 48
  • 65
  • Try using single quotes around the `select...` clause when _setting_ the variable. I googled around a lot and the Hive variable capability is pretty limited -- basically macros. I would consider rethinking how you could accomplish your objective using SQL. – Tom Harrison Apr 17 '16 at 19:28

1 Answers1

0

When there isn't a documented way one can use documented features to provide a clean enough hack :)

Here's my attempt, using dfs commands from hive, shell commands from hive, the source-command and what not. I guess it might not work out of the box with queries through Hiveserver2. I would be glad if there were a prettier way

Let's go

Basic setup

SET EMPLOYEE_TABLE_LOCATION=/home/hduser/test/;

CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${hiveconf:EMPLOYEE_TABLE_LOCATION}';

SET PATH_TO_SETTINGS_FILE=hdfs:/tmp/query_to_setting;
SET FILENAME_ON_LOCAL_FS=query_to_setting.sql;

Generate a file in hdfs

with content "SET q1=<the-query-result>;"

CREATE TABLE query_to_setting_table
LOCATION '${hiveconf:PATH_TO_SETTINGS_FILE}'
AS
SELECT concat('SET q1=', ceiling(count(*)/2),'\;') from employees;

Source in the generated file as any sql-file.

First put the file to local fs since 'source' only operates on local disk...

dfs -get ${hiveconf:PATH_TO_SETTINGS_FILE}/000000_0 ${hiveconf:FILENAME_ON_LOCAL_FS};
source ${hiveconf:FILENAME_ON_LOCAL_FS};

Try the setting

hive> SET q1;
q1=4

Use the setting in a query

hive > SELECT lname,
          salary,
          NTILE( ${hiveconf:q1}) OVER (
                      ORDER BY salary DESC) AS quartile
       FROM employees;
OK
Berg    12000   1
Faviet  9000    1
Paino   8790    2
Chen    8200    2
Urman   7800    3
Sciarra 7700    3
Popp    6900    4

Optional cleanup

!rm ${hiveconf:FILENAME_ON_LOCAL_FS};
DROP TABLE query_to_setting_table;
selle
  • 868
  • 1
  • 10
  • 27