0

I have an Airflow pipeline, and one of the DAGs contains a Spark job. I have two options for the Spark job (the job writes to ElasticSearch, but I don't know if this is useful information):

  1. write the job in Scala to increase performance
  2. use pySPark as the Airflow pipeline is defined in python

Is there a better option for readabilit/performance/error handling ? (I have no preference between Scala or Python language)

Thank you in advance

  • I think a big part of this is preference. I'd prefer Scala as I like Scala and jvm. Also the exceptions would be jvm related ones, so easier for me to grasp. So I think it depends on what your own personal preference is. You can however super easily run the sparkjob as scala version. And you can also do it with Python. So you choose which one you think you prefer :) and in regards to performance.. I heard that Scala is a bit better but I don't think this matters unless you are at a huge amount of data. – GamingFelix Oct 06 '20 at 14:47
  • to do it with Scala, https://stackoverflow.com/questions/39827804/how-to-run-spark-code-in-airflow – GamingFelix Oct 06 '20 at 14:48

0 Answers0