Short Answer
There are quirks in ordering of arguments where --packages
isn't accepted by spark-submit
if it comes after the my_job.py
argument. To workaround this, you can do the following when submitting from Dataproc's CLI:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0
before the .py
file in your command.
Long Answer
So, this is actually a different issue than the known lack of support for --jars
in gcloud beta dataproc jobs submit pyspark
; it appears that without Dataproc explicitly recognizing --packages
as a special spark-submit
-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages
fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:
# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0
But switching the order of the arguments does work again, even though in the pyspark
case, both orderings work:
# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py
So even though spark-submit job.py
is supposed to be a drop-in replacement for everything that previously called pyspark job.py
, the difference in parse ordering for things like --packages
means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.
Anyhow, fortunately there's a workaround, since --packages
is just another alias for the Spark property spark.jars.packages
, and Dataproc's CLI supports properties just fine. So you can just do the following:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Note that the --properties
must come before the my_job.py
, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
.