4

I have a pyspark project with a python script which runs spark-streaming. I've got some external dependencies which I run with --packages flag.

However, in scala, we can use maven to download all required packages, make a jar file with the main spark program and have everything in one jar and then just use spark-submit to submit it to the cluster (yarn in my case).

Is there any such similar things as jar for pyspark?

There is no such information on the official documentation of spark. They just mention use spark-submit <python-file> or add --py-files but it isn't as professional as a jar file.

Any suggestion would be helpful! Thanks!

HackCode
  • 1,837
  • 6
  • 35
  • 66
  • With Python you must send the main Python file and other files with the options: py-files (files to place in the pthon environment), --archives (compressed files to be extracted in the working directory of the executors) or --files ( files to be placed in the working directory of the executors). Then, you can create a zip with all you want and use --archives option. – Carlos AG Jun 10 '23 at 06:12

1 Answers1

1

The documentation says you can use zip or egg.

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

Source

You might also find the other parameters useful.

Community
  • 1
  • 1
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245