0

I was trying to access HDFS files from a spark cluster which is running inside Kubernetes containers.

However I keep on getting the error: AnalysisException: 'The ORC data source must be used with Hive support enabled;'

What I am missing here?

Roman Pokrovskij
  • 9,449
  • 21
  • 87
  • 142
Alok Gogate
  • 3
  • 1
  • 2
  • Before Spark 2.3 (I think) there was no built-in library for ORC, Spark used the Hive libraries -- i.e. an _full_ install of the Hive client libraries, besides the Spark install, and the appropriate Classpath to reach these libs. Upgrade to V2.3 or 2.4 if you can... – Samson Scharfrichter Dec 15 '18 at 16:53

1 Answers1

0

Are you have SparkSession created with enableHiveSupport()?

Similar issue: Spark can access Hive table from pyspark but not from spark-submit

  • Yes i have created a spark session with enableHiveSupport() here is my code: `conf = pyspark.SparkConf() conf.setMaster("spark://10.41.0.5:7077" conf.setAppName("test") sc = pyspark.SparkContext(conf=conf) spark = SparkSession(sc).builder.enableHiveSupport().getOrCreate() df = spark.read.orc("hdfs://....")` and yet get the same error: AnalysisException: 'The ORC data source must be used with Hive support enabled;'` – Alok Gogate Dec 15 '18 at 08:32
  • Here hdfs is running on a separate cluster. – Alok Gogate Dec 15 '18 at 09:26