10

While connecting to Hive2 using Python with below code:

import pyhs2

with pyhs2.connect(host='localhost',
           port=10000,
           authMechanism="PLAIN",
           user='root',
           password='test',
           database='default') as conn:
with conn.cursor() as cur:
    #Show databases
    print cur.getDatabases()

    #Execute query
    cur.execute("select * from table")

    #Return column info from query
    print cur.getSchema()

    #Fetch table results
    for i in cur.fetch():
        print i

I am getting below error:

File
"C:\Users\vinbhask\AppData\Roaming\Python\Python36\site-packages\pyhs2-0.6.0-py3.6.egg\pyhs2\connections.py",
line 7, in <module>
    from cloudera.thrift_sasl import TSaslClientTransport ModuleNotFoundError: No module named 'cloudera'

Have tried here and here but issue wasn't resolved.

Here is the packages installed till now:

bitarray0.8.1,certifi2017.7.27.1,chardet3.0.4,cm-api16.0.0,cx-Oracle6.0.1,future0.16.0,idna2.6,impyla0.14.0,JayDeBeApi1.1.1,JPype10.6.2,ply3.10,pure-sasl0.4.0,PyHive0.4.0,pyhs20.6.0,pyodbc4.0.17,requests2.18.4,sasl0.2.1,six1.10.0,teradata15.10.0.21,thrift0.10.0,thrift-sasl0.2.1,thriftpy0.3.9,urllib31.22

Error while using Impyla:

Traceback (most recent call last):
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\Scripts\HiveConnTester4.py", line 1, in <module>
from impala.dbapi import connect
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\impala\dbapi.py", line 28, in <module>
import impala.hiveserver2 as hs2
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\impala\hiveserver2.py", line 33, in <module>
from impala._thrift_api import (
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\impala\_thrift_api.py", line 74, in <module>
include_dirs=[thrift_dir])
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\thriftpy\parser\__init__.py", line 30, in load
include_dir=include_dir)
File "C:\Users\xxxxx\AppData\Local\Programs\Python\Python36-32\lib\site-packages\thriftpy\parser\parser.py", line 496, in parse
url_scheme))
thriftpy.parser.exc.ThriftParserError: ThriftPy does not support generating module with path in protocol 'c'
jmattheis
  • 10,494
  • 11
  • 46
  • 58
Vinod
  • 376
  • 2
  • 11
  • 34
  • I'm amazed that so many people complain suddenly about PyHive (which is currently broken *[Aug 2017]*) and PyHS2 (which you clearly cannot make work). Try ImPyla instead. It's maintained by Cloudera. And it works. – Samson Scharfrichter Aug 30 '17 at 09:49
  • @SamsonScharfrichter: I had tried in Impyla as well, updated the error log for that as above – Vinod Aug 30 '17 at 17:08
  • How about PySpark? – OneCricketeer Sep 02 '17 at 07:27
  • Looks like you have a rogue dependency in your Python stack that plays hell with all DB drivers... I suggest that you make a separate, clean Python install, but with Anaconda; then a clean install of Impyla (with the Anaconda installer, cf. https://anaconda.org/conda-forge/impyla). If that one works, then you will know for sure that your current Python install is to blame. – Samson Scharfrichter Sep 04 '17 at 20:38
  • @SamsonScharfrichter: Done as advised and below is the error: File "C:\Users\xxxx\AppData\Local\Continuum\Anaconda3\envs\QA_DataValidator\lib\site-packages\thriftpy\protocol\binary.py", line 178, in read_message_begin message='No protocol version header') thriftpy.protocol.exc.TProtocolException: TProtocolException(type=4) – Vinod Sep 06 '17 at 06:02
  • @cricket_007: Can we connect to remote HIVE2 server from our local system using pyspark? – Vinod Sep 06 '17 at 06:06
  • 1
    Back to the basics: (1) which version of Hive are you running server-side and (2) does anyone else connect successfully to that thing? – Samson Scharfrichter Sep 06 '17 at 11:04
  • Yes, of course you can. Assuming the firewall allows you to – OneCricketeer Sep 06 '17 at 12:27
  • @SamsonScharfrichter 1: Hive 2.1.1-mapr-1703-r1, 2: No one else am the first person trying for some automation scripts earlier I had connected via Java JDBC APIs to the same server using Eclipse java program – Vinod Sep 07 '17 at 03:54

3 Answers3

1

thrift_sasl.py is trying cStringIO which is no longer available in Python 3.0. Try with python 2 ?

Xire
  • 166
  • 7
1

You may need to install an unreleased version of thrift_sasl. Try:

pip install git+https://github.com/cloudera/thrift_sasl
Tagar
  • 13,911
  • 6
  • 95
  • 110
  • @Vinod was this helpful? – Tagar Sep 06 '17 at 03:20
  • No am getting this error "Failed to connect to github.com port 443: Timed out" – Vinod Sep 06 '17 at 05:59
  • The last error hints you're behind a firewall - that's why you're getting a timeout on accessing port 443. Change `https:` to `http:` and try again - port 80 might be open. – Tagar Sep 06 '17 at 15:06
  • Getting this error after changing to http "fatal: unable to access 'http://github.com/cloudera/thrift_sasl/': Failed to connect to github.com port 80: Timed out Command "git clone -q http://github.com/cloudera/thrift_sasl C:\Users\xxxx\AppData\Local\Temp\pip-5_9y1ynj-build" failed with error code 128 in None" – Vinod Sep 07 '17 at 04:40
0

If you're comfortable learning PySpark, then you just need to setup the hive.metastore.uris property to point at the Hive Metastore address, and you're ready to go.

The easiest way to do that would be to export the hive-site.xml from the your cluster, then pass --files hive-site.xml during spark-submit.

(I haven't tried running standalone Pyspark, so YMMV)

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245