Unable to connect to Hive2 using Python

Question

While connecting to Hive2 using Python with below code:

import pyhs2

with pyhs2.connect(host='localhost',
           port=10000,
           authMechanism="PLAIN",
           user='root',
           password='test',
           database='default') as conn:
with conn.cursor() as cur:
    #Show databases
    print cur.getDatabases()

    #Execute query
    cur.execute("select * from table")

    #Return column info from query
    print cur.getSchema()

    #Fetch table results
    for i in cur.fetch():
        print i

I am getting below error:

File
"C:\Users\vinbhask\AppData\Roaming\Python\Python36\site-packages\pyhs2-0.6.0-py3.6.egg\pyhs2\connections.py",
line 7, in <module>
    from cloudera.thrift_sasl import TSaslClientTransport ModuleNotFoundError: No module named 'cloudera'

Have tried here and here but issue wasn't resolved.

Here is the packages installed till now:

bitarray0.8.1,certifi2017.7.27.1,chardet3.0.4,cm-api16.0.0,cx-Oracle6.0.1,future0.16.0,idna2.6,impyla0.14.0,JayDeBeApi1.1.1,JPype10.6.2,ply3.10,pure-sasl0.4.0,PyHive0.4.0,pyhs20.6.0,pyodbc4.0.17,requests2.18.4,sasl0.2.1,six1.10.0,teradata15.10.0.21,thrift0.10.0,thrift-sasl0.2.1,thriftpy0.3.9,urllib31.22

Error while using Impyla:

Traceback (most recent call last):
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\Scripts\HiveConnTester4.py", line 1, in <module>
from impala.dbapi import connect
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\impala\dbapi.py", line 28, in <module>
import impala.hiveserver2 as hs2
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\impala\hiveserver2.py", line 33, in <module>
from impala._thrift_api import (
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\impala\_thrift_api.py", line 74, in <module>
include_dirs=[thrift_dir])
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\thriftpy\parser\__init__.py", line 30, in load
include_dir=include_dir)
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\thriftpy\parser\parser.py", line 496, in parse
url_scheme))
thriftpy.parser.exc.ThriftParserError: ThriftPy does not support generating module with path in protocol 'c'

I'm amazed that so many people complain suddenly about PyHive (which is currently broken *[Aug 2017]*) and PyHS2 (which you clearly cannot make work). Try ImPyla instead. It's maintained by Cloudera. And it works. — Samson Scharfrichter, Aug 30 '17 at 09:49
@SamsonScharfrichter: I had tried in Impyla as well, updated the error log for that as above — Vinod, Aug 30 '17 at 17:08
Looks like you have a rogue dependency in your Python stack that plays hell with all DB drivers... I suggest that you make a separate, clean Python install, but with Anaconda; then a clean install of Impyla (with the Anaconda installer, cf. https://anaconda.org/conda-forge/impyla). If that one works, then you will know for sure that your current Python install is to blame. — Samson Scharfrichter, Sep 04 '17 at 20:38
@SamsonScharfrichter: Done as advised and below is the error: File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\QA_DataValidator\lib\site-packages\thriftpy\protocol\binary.py", line 178, in read_message_begin message='No protocol version header') thriftpy.protocol.exc.TProtocolException: TProtocolException(type=4) — Vinod, Sep 06 '17 at 06:02
@cricket_007: Can we connect to remote HIVE2 server from our local system using pyspark? — Vinod, Sep 06 '17 at 06:06
Back to the basics: (1) which version of Hive are you running server-side and (2) does anyone else connect successfully to that thing? — Samson Scharfrichter, Sep 06 '17 at 11:04
@SamsonScharfrichter 1: Hive 2.1.1-mapr-1703-r1, 2: No one else am the first person trying for some automation scripts earlier I had connected via Java JDBC APIs to the same server using Eclipse java program — Vinod, Sep 07 '17 at 03:54

score 1 · Answer 1 · answered Aug 30 '17 at 08:52

1

thrift_sasl.py is trying cStringIO which is no longer available in Python 3.0. Try with python 2 ?

answered Aug 30 '17 at 08:52

Xire

166
7

We have a requirement to use python3+ – Vinod Aug 30 '17 at 17:10
pysh2 is no longer maintained. Do you try with PyHive ? – Xire Aug 31 '17 at 07:38

score 1 · Answer 2 · answered Sep 05 '17 at 04:54

1

You may need to install an unreleased version of thrift_sasl. Try:

pip install git+https://github.com/cloudera/thrift_sasl

answered Sep 05 '17 at 04:54

Tagar

13,911
6
95
110

@Vinod was this helpful? – Tagar Sep 06 '17 at 03:20
No am getting this error "Failed to connect to github.com port 443: Timed out" – Vinod Sep 06 '17 at 05:59
The last error hints you're behind a firewall - that's why you're getting a timeout on accessing port 443. Change `https:` to `http:` and try again - port 80 might be open. – Tagar Sep 06 '17 at 15:06
Getting this error after changing to http "fatal: unable to access 'http://github.com/cloudera/thrift_sasl/': Failed to connect to github.com port 80: Timed out Command "git clone -q http://github.com/cloudera/thrift_sasl C:\Users\xxxx\AppData\Local\Temp\pip-5_9y1ynj-build" failed with error code 128 in None" – Vinod Sep 07 '17 at 04:40

score 0 · Answer 3 · answered Sep 06 '17 at 12:32

If you're comfortable learning PySpark, then you just need to setup the hive.metastore.uris property to point at the Hive Metastore address, and you're ready to go.

The easiest way to do that would be to export the hive-site.xml from the your cluster, then pass --files hive-site.xml during spark-submit.

(I haven't tried running standalone Pyspark, so YMMV)

Unable to connect to Hive2 using Python

3 Answers3