I am not 100% sure I understand how you normally run the script but lets assume you have a script called script.py which you want to receive 2 arguments arg1, arg2 and when you run from the command line using spark-submit you have 2 options opt1 and opt2 run it as follows:
spark-submit --opt1 opt1 --opt2 opt2 script.py arg1 arg2
If I understand correctly in your case this is:
spark-submit --jars spark-assembly-1.5.2.2.3.4.7-4-hadoop2.7.1.2.3.4.7-4.jar,spark-streaming-kafka-assembly_2.10-1.6.1.jar file.py arg1 arg2
Let's also assume that everything runs when you do so from the command line (if not then make sure that runs first).
** Define environment variables **
The goal of this step is to enable running as follows:
python script.py arg1 arg2
To do so you need to define the proper environment variables:
PYTHONPATH
Should include the python and py4j definitions:
$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-XXX-src.zip
- $SPARK_HOME is where you installed spark (e.g. /opt/spark). In windows you might have defined it as %SPARK_HOME% (or you can just put it directly).
- The XXX in the py4j path depends on your version.
- For example for spark 2.0.1 this would be py4j-0.10.3-src.zip.
- For spark 1.6.1 I think this was py4j-0.9-src.zip but you should check.
PYSPARK_SUBMIT_ARGS
This tells spark how to load everything. It should include all arguments to spark-submit followed by "pyspark-shell" in the end.
In your case this would probably have the following value:
--jars spark-assembly-1.5.2.2.3.4.7-4-hadoop2.7.1.2.3.4.7-4.jar,spark-streaming-kafka-assembly_2.10-1.6.1 pyspark-shell
Configure the run configuration
Now you can configure this the same as any python script. Just make sure to have the arguments in the script parameters