I am using Pyspark to connect to HIVE and fetch some data. The issue is that it returns all rows with the values that are column names. It is returning correct column names. Only the Row values are incorrect.
Here is my Code
hive_jar_path="C:Users/shakir/Downloads/ClouderaHiveJDBC-2.6.11.1014/ClouderaHiveJDBC-2.6.11.1014/ClouderaHiveJDBC42-2.6.11.1014/HiveJDBC42.jar"
print(hive_jar_path)
print("")
import os
os.environ["HADOOP_HOME"]="c:/users/shakir/downloads/spark/spark/spark"
import os
os.environ["SPARK_HOME"]="c:/users/shakir/downloads/spark/spark/spark"
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
import uuid
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", "hdfs://...../user/hive/warehouse/..../....")
spark.config("spark.driver.extraClassPath", hive_jar_path)
spark.config("spark.sql.hive.llap", "true")
spark.config("spark.sql.warehouse.dir", "hdfs://...../user/hive/warehouse/..../....")
spark=spark.enableHiveSupport().getOrCreate()
import databricks.koalas as ks
print("Reading Data from Hive . . .")
options={
"fetchsize":1000,
"inferSchema": True,
"fileFormat":"orc",
"inputFormat":"org.apache.hadoop.hive.ql.io.orc.OrcInputFormat",
"outputFormat":"org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat",
"driver":"org.apache.hive.jdbc.HiveDriver",
}
df = ks.read_sql("SELECT * FROM PERSONS LIMIT 3", connection_string,options=options)
print("Done")
print(df)
Output of the code:
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+
| Name | Age | Address |
+------+-----+---------+