0

I have a pyspark application, which I submit using spark-submit like this:

spark-submit --deploy-mode cluster --master yarn --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 app.py

This works. But only because the cluster has internet access, and can download the dependency specified by "--packages".

The goal

Now I would like to bundle my pyspark application with its dependencies, so they would not be downloaded.

I find some tutorials about how to bundle python dependencies, but that is not what I need.

What I tried:

I used the maven-shade-plugin, creating a maven package like this where I specified org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0 as a dependency and bundled it using

mvn package

I submitted the resulting jar using:

spark-submit --deploy-mode cluster --master yarn app.jar --py-files app.py

But I got:

Exception in thread "main" org.apache.spark.SparkException: No main class set in JAR; please specify one with --class.

I do not have a main class, because my main is python packge, or not?!?

So how do I do this?

Nathan
  • 7,099
  • 14
  • 61
  • 125
  • 1
    You are on the right path building an uber-jar with dependencies, except your last spark-submit is trying to deploy a jar job with python files as "extras" to the executor. Try `spark-submit --jars app.jar app.py`. Good luck! – Sai Sep 27 '20 at 19:38
  • 1
    You may also want to look at https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit there might be other conf that is needed to ensure jar ends up in classpath of the driver, executor. Can't recall those from top of my head. – Sai Sep 27 '20 at 19:40

0 Answers0