I'm trying to set up some Views in Hive that will take a date as a dynamic parameter. In my working below, I've swapped to using the hiveconf variable in the Select clause, so we can see what's going on, but the principle remains the same
According to this and this , I should be able to include a statement in " ${hiveconf:dateRangeFrom}" in my Create View statement, supply the hiveconf:dateRangeFrom variable at run time for maximum happiness, but this is just not happening - Hive appears to be using whatever value is assigned to the variable when the View is created and hard-coding it into the View definition, not substituting it at runtime as you might expect.
I've got a workaround, whereby I supply a parameter to a sql file that then creates all the views, substituting the desired value in, but this isn't sustainable
All the working is below, so you can see how I came to this conclusion. Any ideas?
1) Supply a hiveconf value to a simple query
(needs to be a date for the ultimate query)
hive -e "Select ${hiveconf:dateRangeFrom} , unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');" --hiveconf dateRangeFrom='2014-01-01'
The date will be returned as suppled, and converted to a unix timestamp (eg "2014-01-01"=1388534400, "2014-09-12"=41047640). The script can be be run repeatedly with results changing accordingly with the parameter.
2) Create a view that returns this data
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
This returns the error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hivevar' in select clause
Presumably because it is trying to do a replacement, but the ${hivevar:dateRangeFrom} variable has not been initialized at this point
According to: Creating Views in Hive with parameter and http://mail-archives.apache.org/mod_mbox/hive-user/201205.mbox/%3CBAY151-W9BC976D584FD172E7D70BC0160@phx.gbl%3E
Then variables can be used in Hive views, long as quotes are used around them:
CREATE VIEW get_date AS
SELECT "${hiveconf:dateRangeFrom}", unix_timestamp("${hiveconf:dateRangeFrom}" , 'yyyy-MM-dd');
This allows the view to be created, so trying to call the view using a parameter:
hive -e "Select * from get_date" --hiveconf dateRangeFrom='2014-01-01'
just returns the variable name:
${hiveconf:dateRangeFrom} NULL
Time taken: 20.614 seconds, Fetched: 1 row(s)
Using single quotes instead:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT '${hiveconf:dateRangeFrom}', unix_timestamp('${hiveconf:dateRangeFrom} ', 'yyyy-MM-dd');
Gives the same result, just the variable name.
3) Create a view in an interactive session with the variable already set
SET hiveconf:dateRangeFrom="2014-02-01";
Rebuild the original view, with the variables without quotes
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
Then calling "select * from get_date;" from within the session gives the expected result.
As does calling from the command line, with the same parameter value:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-02-01'
However, if we call the view with a different parameter, then we still get the original answer:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-09-12'
2014-02-01 1391212800
Time taken: 24.773 seconds, Fetched: 1 row(s)
If we set the variable inside a new session:
SET hiveconf:dateRangeFrom="2014-06-01";
or even not set it all, we still get the same result
Looking at the extended view definition, the reason is obvious:
hive> describe extended get_date;
OK
_c0 string
_c1 bigint
Detailed Table Information Table(tableName:get_date, dbName:default, owner:
36015to, createTime:1410523149, lastAccessTime:0, retention:0, sd:StorageDescrip
tor(cols:[FieldSchema(name:_c0, type:string, comment:null), FieldSchema(name:_c1
, type:bigint, comment:null)], location:null, inputFormat:org.apache.hadoop.mapr
ed.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequen
ceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:nu
ll, serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], parameter
s:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValu
eLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{
transient_lastDdlTime=1410523149}, ***viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')***, tableType:VIRTUAL_VIEW)
Time taken: 0.123 seconds, Fetched: 4 row(s)
The variable substitution took place when the view was created, and hard-coded that date into the definition:
viewOriginalText:SELECT "2014-02-01", unix_t imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un ix_timestamp("2014-02-01" , 'yyyy-MM-dd')
4) Switch off variable subsitution
Hive is clearly putting in the current value of the variable at run-time, so I tried switching it off and recreating the query:
hive> set hive.variable.substitute;
hive.variable.substitute=true
hive> set hive.variable.substitute = false;
hive> set hive.variable.substitute;
hive.variable.substitute=false
The Create View statement still fails with the same error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hiveconf' in select clause
5) Workaround
If we create a sql file that creates the views, testParam.sql, we can sort of get around the problem:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hivevar:dateRangeFrom}, unix_timestamp(${hivevar:dateRangeFrom} , 'yyyy-MM-dd');
SELECT * FROM get_date;
Calling that from the command line gives the expected results:
hive -f testParam.sql --hiveconf dateRangeFrom='2014-08-01'
2014-08-01 1406847600
Time taken: 20.763 seconds, Fetched: 1 row(s)
hive -f testParam.sql --hiveconf dateRangeFrom='2014-09-12'
2014-09-12 1410476400
Time taken: 19.74 seconds, Fetched: 1 row(s)
This does work, and will be fine for now, but is hardly ideal for a distributed, multi-user environment. Looking at the view meta-data, we can see that the view is always destroyed and rebuilt with the latest parameters:
transient_lastDdlTime=1410525287}, viewOriginalText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), viewExpandedText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), tableType:VIRTUAL_VIEW)
So, how to create a view that can be supplied with dynamic parameters at runtime without constantly rebuilding it