1

Using Spark 3.1, I need to provide the hive configuration via the spark-submit command (not inside the code).


Inside the code (which is not the solution I need), I can do the following which works fine (able to list database and select from tables. Removing the "enableHiveSupport" also works fine as long as the config is specified):

spark = SparkSession.builder.appName("redacted-sprak31") \
    .enableHiveSupport()\
    .config("spark.sql.warehouse.dir",
                "hdfs://" + hdfs_host + ":8020/user/hive/warehouse")\
    .config("hive.metastore.uris", "thrift://" + hdfs_host + ":9083") \
    .config("spark.sql.hive.metastore.jars.path", "file:///spark_jars/var/hive_jars/*.jar"  ) \
    .config("spark.sql.hive.metastore.version", "MYVERSION") \
    .config("spark.sql.hive.metastore.jars", "path") \
    .config("spark.sql.catalogImplementation", "hive") \
    .getOrCreate()

Which is submited like this:

spark-submit \
--py-files={file} local://__main__.py

using the --conf flag in the spark-submit command, and removing all the config statements from the __main__.py file:

spark-submit \
--conf spark.sql.warehouse.dir="hdfs://${hdfs_host}:8020/user/hive/warehouse" \
--conf hive.metastore.uris="thrift://${hdfs_host}:9083" \
--conf spark.sql.hive.metastore.jars.path="file:///spark_jars/var/hive_jars/*.jar" \
--conf spark.sql.hive.metastore.version="MYVERSION" \
--conf spark.sql.hive.metastore.jars="path" \
--conf spark.sql.catalogImplementation="hive" \
--py-files={file} local://__main__.py

with, in main.py:

spark = SparkSession.builder.appName("redacted-sprak31") \
    .getOrCreate()

This provides me with the following error when exetuing the very same SQL statement (a simple select * from DB.TABLE limit 10):

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/sandbox/__main__.py", line 12, in <module>
    df = spark.sql("select * from db.tablelimit 10")
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Table or view not found: db.table; line 1 pos 14;
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'UnresolvedRelation [db, table], [], false
  • Why does the parameters passed via --conf do not trigger the same behavior as with in code configuration ?
  • What, consequently, am I missing for spark to behave as expected (connects correctly to the metastore) ?
Itération 122442
  • 2,644
  • 2
  • 27
  • 73

0 Answers0