1

I have a metrics.py which calculates a graph.

I can call it in the terminal command line (python ./metrics.py -i [input] [output]).

I want to write a function in Spark. It calls the metrics.py script to run on the provide file path and collects the values that metrics.py prints out.

How can I do that?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Fengyu
  • 35
  • 2
  • 6

1 Answers1

4

In order to run metrics.py, you essentially ship it to all the executor nodes that run your Spark Job.

To do this, you either pass it via SparkContext -

sc = SparkContext(conf=conf, pyFiles=['path_to_metrics.py'])

or pass it later using the Spark Context's addPyFile method -

sc.addPyFile('path_to_metrics.py')

In either case, after that, do not forget to import metrics.py and then just call needed function that gives needed output.

import metrics
metrics.relevant_function()

Also make sure you have all the python libraries that are imported inside metrics.py installed on all executor nodes. Else, take care of them using the --py-files and --jars handles while spark-submitting your job.

Shantanu Alshi
  • 528
  • 4
  • 13