I have a pyspark application, which I submit using spark-submit
like this:
spark-submit --deploy-mode cluster --master yarn --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 app.py
This works. But only because the cluster has internet access, and can download the dependency specified by "--packages".
The goal
Now I would like to bundle my pyspark application with its dependencies, so they would not be downloaded.
I find some tutorials about how to bundle python dependencies, but that is not what I need.
What I tried:
I used the maven-shade-plugin
, creating a maven package like this where I specified org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0
as a dependency and bundled it using
mvn package
I submitted the resulting jar using:
spark-submit --deploy-mode cluster --master yarn app.jar --py-files app.py
But I got:
Exception in thread "main" org.apache.spark.SparkException: No main class set in JAR; please specify one with --class.
I do not have a main class, because my main is python packge, or not?!?
So how do I do this?