Using Spark 3.1, I need to provide the hive configuration via the spark-submit
command (not inside the code).
Inside the code (which is not the solution I need), I can do the following which works fine (able to list database and select from tables. Removing the "enableHiveSupport" also works fine as long as the config is specified):
spark = SparkSession.builder.appName("redacted-sprak31") \
.enableHiveSupport()\
.config("spark.sql.warehouse.dir",
"hdfs://" + hdfs_host + ":8020/user/hive/warehouse")\
.config("hive.metastore.uris", "thrift://" + hdfs_host + ":9083") \
.config("spark.sql.hive.metastore.jars.path", "file:///spark_jars/var/hive_jars/*.jar" ) \
.config("spark.sql.hive.metastore.version", "MYVERSION") \
.config("spark.sql.hive.metastore.jars", "path") \
.config("spark.sql.catalogImplementation", "hive") \
.getOrCreate()
Which is submited like this:
spark-submit \
--py-files={file} local://__main__.py
using the --conf
flag in the spark-submit
command, and removing all the config
statements from the __main__.py
file:
spark-submit \
--conf spark.sql.warehouse.dir="hdfs://${hdfs_host}:8020/user/hive/warehouse" \
--conf hive.metastore.uris="thrift://${hdfs_host}:9083" \
--conf spark.sql.hive.metastore.jars.path="file:///spark_jars/var/hive_jars/*.jar" \
--conf spark.sql.hive.metastore.version="MYVERSION" \
--conf spark.sql.hive.metastore.jars="path" \
--conf spark.sql.catalogImplementation="hive" \
--py-files={file} local://__main__.py
with, in main.py:
spark = SparkSession.builder.appName("redacted-sprak31") \
.getOrCreate()
This provides me with the following error when exetuing the very same SQL statement (a simple select * from DB.TABLE limit 10
):
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/sandbox/__main__.py", line 12, in <module>
df = spark.sql("select * from db.tablelimit 10")
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/session.py", line 723, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Table or view not found: db.table; line 1 pos 14;
'GlobalLimit 10
+- 'LocalLimit 10
+- 'Project [*]
+- 'UnresolvedRelation [db, table], [], false
- Why does the parameters passed via
--conf
do not trigger the same behavior as with in code configuration ? - What, consequently, am I missing for spark to behave as expected (connects correctly to the metastore) ?