1

I am new to Spark. i have developed a pyspark script though the jupyter notebook interactive UI installed in our HDInsight cluster. A of now I ran the code from the jupyter itself but now I have to automate the script. I tried to use Azure Datafactory but could not find a way to run the pyspark script from there. Also tried to use oozie but could not figure out how to use it.I have tried by saving the notebook and reopened it and ran all cells but it is like manual way.

Please Help me to schedule a pyspark job in microsoft Azure.

Arron
  • 1,134
  • 2
  • 13
  • 32

1 Answers1

2

I searched a discussion about the best practice to run scheduled jobs like crontab with Apache Spark for pyspark, which you might reviewed.

If without oozie, I have a simple idea that is to save jupyter notebook to local and write a shell script to submit the python script to HDInsight Spark via Livy with linux crontab as scheduler. As reference, you can refer to there as below.

  1. IPython Notebook save location
  2. How can I configure pyspark on livy to use anaconda python instead of the default one
  3. Submit Spark jobs remotely to an Apache Spark cluster on HDInsight using Livy

Hope it helps.

Community
  • 1
  • 1
Peter Pan
  • 23,476
  • 4
  • 25
  • 43