HIVE JDBC Connection Using Pyspark returns Column names as row values

Question

I am using Pyspark to connect to HIVE and fetch some data. The issue is that it returns all rows with the values that are column names. It is returning correct column names. Only the Row values are incorrect.

Here is my Code

hive_jar_path="C:Users/shakir/Downloads/ClouderaHiveJDBC-2.6.11.1014/ClouderaHiveJDBC-2.6.11.1014/ClouderaHiveJDBC42-2.6.11.1014/HiveJDBC42.jar"
print(hive_jar_path)
print("")

import os
os.environ["HADOOP_HOME"]="c:/users/shakir/downloads/spark/spark/spark"
import os
os.environ["SPARK_HOME"]="c:/users/shakir/downloads/spark/spark/spark"
import findspark
findspark.init()



from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession

import uuid
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", "hdfs://...../user/hive/warehouse/..../....")
    

spark.config("spark.driver.extraClassPath", hive_jar_path)
spark.config("spark.sql.hive.llap", "true")
spark.config("spark.sql.warehouse.dir", "hdfs://...../user/hive/warehouse/..../....")


spark=spark.enableHiveSupport().getOrCreate()

import databricks.koalas as ks


print("Reading Data from Hive . . .")
options={
    "fetchsize":1000,
    "inferSchema": True,
    "fileFormat":"orc",
    "inputFormat":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
    "outputFormat":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
    "driver":"org.apache.hive.jdbc.HiveDriver",
    }
df = ks.read_sql("SELECT * FROM PERSONS LIMIT 3", connection_string,options=options)
print("Done")
print(df)

Output of the code:

+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+

I think the issue is with your connection string – Aditya Vikram Singh Dec 22 '20 at 03:40 — Aditya Vikram Singh, Dec 22 '20 at 03:40

score 1 · Answer 1 · answered Mar 13 '22 at 12:16

my situation was not the same as yours, as I was connecting to Sparkthru jdbc however I was getting the same results as you did: column names returned as literals.

I fixed it by overriding the JdbcDialect. This answer has a nice explanation of why and some useful links as well 1, 2.

import org.apache.spark.sql.jdbc.JdbcDialect
private case object HiveDialect extends JdbcDialect {
  override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive2")
  override def quoteIdentifier(colName: String): String = {
    colName.split(‘.’).map(part => s”`$part`”).mkString(“.”)
  }
}
//Register dialect
JdbcDialects.registerDialect(HiveDialect)

Just pay attention to the quoting characters, medium has this weird typography for quoting characters.

Hope it helps after this long. Leaving it here for when I forget about it and google it again :p

This came very handy today. Now I understand why and how can we change the Dialect. — Kumar Rohit, Jul 27 '23 at 07:47

HIVE JDBC Connection Using Pyspark returns Column names as row values

1 Answers1