13

I have some data in HDFS,i need to access that data using python,can anyone tell me how data is accessed from hive using python?

subina mohanan
  • 248
  • 1
  • 2
  • 7

3 Answers3

12

To install you'll need these libraries:

pip install sasl
pip install thrift
pip install thrift-sasl
pip install PyHive

If you're on Linux, you may need to install SASL separately before running the above. Install the package libsasl2-dev using apt-get or yum or whatever package manager. For Windows there are some options on GNU.org. On a Mac SASL should be available if you've installed xcode developer tools (xcode-select --install)

After installation, you can execute a hive query like this:

from pyhive import hive
conn = hive.Connection(host="YOUR_HIVE_HOST", port=PORT, username="YOU")

Now that you have the hive connection, you have options how to use it. You can just straight-up query:

cursor = conn.cursor()
cursor.execute("SELECT cool_stuff FROM hive_table")
for result in cursor.fetchall():
  use_result(result)

...or to use the connection to make a Pandas dataframe:

import pandas as pd
df = pd.read_sql("SELECT cool_stuff FROM hive_table", conn)
Tristan Reid
  • 5,844
  • 2
  • 26
  • 31
  • 2
    "Could not start SASL: %s" % self.sasl.getError()" - under Windows 2008 R2, python 3.6. How solve this problem? – Nikolay Baranenko Feb 04 '18 at 22:00
  • 1
    this is thrift issue, I was facing the same issue, on Linux environment, it works fine. – Abdul Majeed May 13 '18 at 08:27
  • Confirming the windows issue and that it does work on Linux. I use Windows 10 for development (PyCharm). I could not "pip install pyhive[hive]" there; the dependencies fail. But if you just "pip install pyhive" (or whatever, using PyCharm's package manager), then you can develop against it. When I import PyHive I wrap it in a if os.name != "nt": statement so it is skipped on Windows. Everything installs and works fine on Linux though (Centos 7). – John Humphreys Dec 11 '18 at 14:25
7

You can use hive library for access hive from python,for that you want to import hive Class from hive import ThriftHive

Below the Example

import sys

from hive import ThriftHive
from hive.ttypes import HiveServerException

from thrift import Thrift
from thrift.transport import TSocket
from thrift.transport import TTransport
from thrift.protocol import TBinaryProtocol

try:
  transport = TSocket.TSocket('localhost', 10000)
  transport = TTransport.TBufferedTransport(transport)
  protocol = TBinaryProtocol.TBinaryProtocol(transport)
  client = ThriftHive.Client(protocol)
  transport.open()
  client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)")
  client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r")
  client.execute("SELECT * FROM r")
  while (1):
    row = client.fetchOne()
    if (row == None):
       break
    print row

  client.execute("SELECT * FROM r")
  print client.fetchAll()
  transport.close()
except Thrift.TException, tx:
  print '%s' % (tx.message)
slayton
  • 20,123
  • 10
  • 60
  • 89
Sreejith
  • 587
  • 1
  • 9
  • 18
  • plz tell me how to get hive library? – subina mohanan Jul 18 '13 at 12:05
  • you can get that library from $HIVE_HOME/lib/py/* copy that content inside the py folder and past into python library – Sreejith Jul 18 '13 at 12:07
  • @Sreejith I have no problem importing those python libraries, however, the code hangs after executing hive command. And it turned out to be a general problem. Were your code connecting to Hiveserver1 or Hiveserver2? https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/lCSuh6vLmHM – B.Mr.W. Nov 06 '13 at 17:30
0

A much simpler solution if you're on Windows uses pyodbc:

  import pyodbc
  import pandas as pd

  # connect odbc to data source name
  conn = pyodbc.connect("DSN=<your_dsn>", autocommit=True)

  # read data into dataframe
  hive_df = pd.read_sql("SELECT * FROM <table_name>", conn)

As long as you have an ODBC driver and a DSN, that's all you need.

Jared Wilber
  • 6,038
  • 1
  • 32
  • 35
  • 3
    It's not that simple on Linux : -/ – Samson Scharfrichter Dec 19 '18 at 11:41
  • @SamsonScharfrichter good point- I'll add a note specifying that my solution is for Windows. – Jared Wilber Dec 19 '18 at 16:59
  • 2
    It's _doable_ on Linux, just not trivial (check that the installed version of `unixODBC` is recent enough, check that your `$LD_LIBRARY_PATH` taps its libraries, install the appropriate ODBC driver, reference that driver in `odbcinst.ini`, define a DSN in `odbc.ini`, troubleshoot) – Samson Scharfrichter Dec 19 '18 at 17:40