2

I'm trying to figure out which is the best way to work with Airflow and Spark/Hadoop. I already have a Spark/Hadoop cluster and I'm thinking about creating another cluster for Airflow that will submit jobs remotely to Spark/Hadoop cluster.

Any advice about it? Looks like it's a little complicated to deploy spark remotely from another cluster and that will create some file configuration duplication.

Henrique Goulart
  • 1,815
  • 2
  • 22
  • 32

3 Answers3

4

You really only need to configure a yarn-site.xml file, I believe, in order for spark-submit --master yarn --deploy-mode client to work. (You could try cluster deploy mode, but I think having the driver being managed by Airflow isn't a bad idea)

Once an Application Master is deployed within YARN, then Spark is running locally to the Hadoop cluster.

If you really want, you could add a hdfs-site.xml and hive-site.xml to be submitted as well from Airflow (if that's possible), but otherwise at least hdfs-site.xml files should be picked up from the YARN container classpath (not all NodeManagers could have a Hive client installed on them)

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • You can do that remotely, but you just need to configure access to your hive metastore, hadoop cluster and resource manager. Please also remember that spark is very sensitive to locallity. Therefore if your spark client is where Airflow and its different network. Then your jobs might be slower. I had issue like that, therefore I have Airflow in the same network where hadoop cluster. – Tomasz Krol Aug 25 '18 at 12:12
  • Right, HDFS prefers local reads, and configuring Spark would be no different than any other Hadoop+Hive client – OneCricketeer Aug 25 '18 at 16:18
  • Ok, thank you for sharing your experience. I'm just looking for the best way to keep these frameworks working together like a couple. I'll keep them in two different cluster. – Henrique Goulart Aug 26 '18 at 17:25
  • If you're using Ambari, you can install Spark and Hive clients onto Airflow workers. And keep configs in sync https://medium.com/@mykolamykhalov/integrating-apache-airflow-with-apache-ambari-ccab2c90173 – OneCricketeer Aug 26 '18 at 17:30
1

I prefer submitting Spark Jobs using SSHOperator and running spark-submit command which would save you from copy/pasting yarn-site.xml. Also, I would not create a cluster for Airflow if the only task that I perform is running Spark jobs, a single VM with LocalExecutor should be fine.

kaxil
  • 17,706
  • 2
  • 59
  • 78
0

There are a variety of options for remotely performing spark-submit via Airflow.

Do note that none of these are plug-and-play ready and you'll have to write your own operators to get things done.

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131