How to query hdfs from a spark cluster (2.1) which is running on kubernetes?

Question

I was trying to access HDFS files from a spark cluster which is running inside Kubernetes containers.

However I keep on getting the error: AnalysisException: 'The ORC data source must be used with Hive support enabled;'

What I am missing here?

Before Spark 2.3 (I think) there was no built-in library for ORC, Spark used the Hive libraries -- i.e. an _full_ install of the Hive client libraries, besides the Spark install, and the appropriate Classpath to reach these libs. Upgrade to V2.3 or 2.4 if you can... — Samson Scharfrichter, Dec 15 '18 at 16:53

score 0 · Accepted Answer · answered Dec 14 '18 at 17:30

0

Are you have SparkSession created with enableHiveSupport()?

answered Dec 14 '18 at 17:30

Yes i have created a spark session with enableHiveSupport() here is my code: `conf = pyspark.SparkConf() conf.setMaster("spark://10.41.0.5:7077" conf.setAppName("test") sc = pyspark.SparkContext(conf=conf) spark = SparkSession(sc).builder.enableHiveSupport().getOrCreate() df = spark.read.orc("hdfs://....")` and yet get the same error: AnalysisException: 'The ORC data source must be used with Hive support enabled;'` – Alok Gogate Dec 15 '18 at 08:32
Here hdfs is running on a separate cluster. – Alok Gogate Dec 15 '18 at 09:26

1 Answers1