5

I am trying to write a spark job with Python that would open a jdbc connection with Impala and load a VIEW directly from Impala into a Dataframe. This question is pretty close but in scala: Calling JDBC to impala/hive from within a spark job and creating a table

How do I do this? There are plenty of examples for other datasources such as MySQL, PostgreSQL, etc. but I haven't seen one for Impala + Python + Kerberos. An example would be of great help. Thank you!

Tried this with information from the web but it didn't work.

SPARK Notebook

#!/bin/bash
export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'

# use Java8
export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH

# JDBC Drivers for Impala
export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30

# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \
# --conf spark.sql.parquet.binaryAsString=true \
# --conf spark.sql.hive.convertMetastoreParquet=false

pyspark --master yarn-client \
        --driver-memory 4G \
        --executor-memory 2G \
        # --num-executors 10 \
        --jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
        --driver-class-path $JDBC_PATH/*.jar

Python Code

properties = {
    "driver": "com.cloudera.impala.jdbc41.Driver",
    "AuthMech": "1",
#     "KrbRealm": "EXAMPLE.COM",
#     "KrbHostFQDN": "impala.example.com",
    "KrbServiceName": "impala"
}

# imp_env is the hostname of the db, works with other impala queries ran inside python
url = "jdbc:impala:imp_env;auth=noSasl"

db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)

I received this error msg (Full Error Log):
Py4JJavaError: An error occurred while calling o42.jdbc. : java.lang.ClassNotFoundException: com.cloudera.impala.jdbc41.Driver

Community
  • 1
  • 1
alfredox
  • 4,082
  • 6
  • 21
  • 29
  • 1
    OK, for starters `--jars` [should be comma separated list](http://stackoverflow.com/q/33961699/1560062). For Java 8 you should probably look for JDBC 4.2. There could be something else I am missing here. – zero323 Sep 09 '16 at 17:01
  • @alfredox: pls. check my answer and let me know whether it works. – Ram Ghadiyaram Sep 22 '16 at 02:48
  • @RamPrasadG: I just found out that my company does not have those drivers available, and they in fact have not configured out internal environments to even give access through jdbc, so unfortunately I won't be able to test it anymore. As it turns out they did however configure Spark to read directly from Impala by somply doing df = sqlContext.sql(query) – alfredox Sep 22 '16 at 20:55

4 Answers4

2

You can use

--jars $(echo /dir/of/jars/*.jar | tr ' ' ',') 

instead of

--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar

or for another approach please see my answer

Community
  • 1
  • 1
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
1

1st approach is to use spark-submit on below impala_jdbc_connection.py script like spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py

impala_jdbc_connection.py

properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}

#initalize the spark session
spark = (
        SparkSession.builder
        .config("spark.jars.packages", "jar-packages-list")
        .config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
        .enableHiveSupport()
        .getOrCreate() 
        )

db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)

db_df.show()

2nd approach is not a direct import from impala to spark but rather a conversion of results to spark dataframe

pip install impyla Source: https://github.com/cloudera/impyla

Connect to impala and fetch results from impala database and convert result to spark dataframe

from impala.dbapi import connect

conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)

cursor = conn.cursor()

cursor.execute('select * from database.table')

res= cursor.fetchall() # convert res to spark dataframe

for data in res:
        print(data)
user1
  • 391
  • 3
  • 27
1

Did this in Azure Databricks notebook after setting up the jar in the cluster libraries. Generally followed previous post except that d is upper case for Driver config. Worked great.

properties = {
"Driver": "com.cloudera.impala.jdbc41.Driver"
}


db_df = spark.read.jdbc(url= 'jdbc:impala://hostname.domain.net:21050/dbname;AuthMech=3;UID=xxxx;PWD=xxxx', table ='product', properties = properties)

db_df.show()
0

This works for me:

spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar 

val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"

val connectionProperties = new java.util.Properties()

val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)