3

I am trying to connect to my hive server from a local copy of Airflow, but it seems like the HiveCliHook is trying to connect to my local copy of Hive.

I'm running to following to test it:

import airflow
from airflow.models import Connection
from airflow.hooks.hive_hooks import  HiveCliHook

usr = 'myusername'
pss = 'mypass'

session = airflow.settings.Session()
hive_cli = session.query(Connection).filter(Connection.conn_id == 'hive_cli_default').all()[0]

hive_cli.host = 'hive_server.test.mydomain.com'
hive_cli.port = '9083'
hive_cli.login = usr
hive_cli.password = pss
hive_cli.schema = 'default'

session.commit()

hive = HiveCliHook()

hive.run_cli("select 1")

Which is throwing this error:

[2018-11-28 13:23:22,667] {base_hook.py:83} INFO - Using connection to: hive_server.test.mydomain.com
[2018-11-28 13:24:50,891] {hive_hooks.py:220} INFO - hive -f /tmp/airflow_hiveop_2Fdl2I/tmpBFoGp7  
[2018-11-28 13:24:55,548] {hive_hooks.py:235} INFO - Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.3.4-bin/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true  
[2018-11-28 13:25:01,776] {hive_hooks.py:235} INFO - FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

Does anyone have any idea where I'm going wrong?

user3434523
  • 178
  • 1
  • 2
  • 9
  • Were you able to figure this out? My guess is that just [like every other `Airflow` `hook`](https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/contrib/hooks/sqoop_hook.py#L35) (and `operator`), this one also works only on **local `Hive` server** and it must be used in tandem with [`SSHHook`](https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/contrib/hooks/ssh_hook.py) in order to fire queries to *remote* Hive server. – y2k-shubham Dec 12 '18 at 11:08
  • I'm a bit confused because in the [docs](https://incubator-airflow.readthedocs.io/en/latest/configuration.html#scaling-out-with-celery) they clearly say `..For example, if you use the HiveOperator, the hive CLI needs to be installed on that box..` However looking at the [code](https://github.com/apache/incubator-airflow/blob/v1-10-stable/airflow/hooks/hive_hooks.py#L81) I don't see any reason why it won't work for remote `Hive` servers – y2k-shubham Dec 12 '18 at 11:44

1 Answers1

1
  • While you can use the HiveCliOperator (unaltered) for connecting and executing HQL statements in remote Hive-Server, the only requirement is that the box that is running your Airflow worker must also contain Hive binaries installed

  • This is so because the hive-cli command prepared by HiveCliHook would be run in worker machine via good-old bash. At this stage, if Hive CLI is not installed in the machine where this code is running (i.e. your Airflow worker), it will break as in your case


Straight-forward workaround is to implement your own RemoteHiveCliOperator that

  • Creates an SSHHook to the remote Hive-server machine
  • And execute your HQL statement via SSHHook like this

In fact this seems to be a universal drawback with almost all Airflow Operators that by default they expect requisite packages installed in every worker. The docs warn about it

For example, if you use the HiveOperator, the hive CLI needs to be installed on that box

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • 1
    Do note that a plausible risk associated with executing commands over `SSH` is **breaking of connection**. ([`SSHHook`](https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/hooks/ssh_hook.py#L145) uses [`paramiko`](https://docs.paramiko.org/en/2.4/)) While its unlikely to bother you, if fault-tolerance is very high on your priority, you can [use `Emr-Steps` to execute `Hive` commands](https://stackoverflow.com/questions/32410325/boto3-emr-hive-step) – y2k-shubham Dec 12 '18 at 15:27
  • Do note that `EMR-Steps` API has an inherent limitation of [sequential execution](https://stackoverflow.com/a/53156794/3679900) – y2k-shubham Dec 18 '18 at 09:08