6

How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.

I need solutions so that Airflow can talk to EMR and execute Spark submit.

https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

These blogs have understanding on execution after connection has been established.(Didn't help much)

In airflow I have made a connection using UI for AWS and EMR:-

enter image description here

Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-

from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
    client = hook.get_client_type(‘emr’, ‘eu-central-1’)
    for x in a:
        print(x[‘Status’][‘State’],x[‘Name’])

My question is - How can I update my above code can do Spark-submit actions

GabLeRoux
  • 16,715
  • 16
  • 63
  • 81
asur
  • 1,759
  • 7
  • 38
  • 81
  • 1
    hi kally please specify what is the issue here that you are facing, what you have tried yet – varnit Jan 03 '19 at 13:06
  • 1
    Hi Kally, Can you share what resources you have created and which connection is not working? – Pradeep Bhadani Jan 03 '19 at 13:39
  • @varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code – asur Jan 03 '19 at 16:45
  • @pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code – asur Jan 03 '19 at 16:46

2 Answers2

15

While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow

  1. Use Apache Livy

    • This solution is actually independent of remote server, i.e., EMR
    • Here's an example
    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me
  2. Use EmrSteps API

    • Dependent on remote system: EMR
    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
  3. Use SSHHook / SSHOperator

    • Again independent of remote system
    • Comparatively easier to get started with
    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome

EDIT-1

There seems to be another straightforward way

  1. Specifying remote master-IP

    • Independent of remote system
    • Needs modifying Global Configurations / Environment Variables
    • See @cricket_007's answer for details

Useful links

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131
  • 1
    Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit – asur Jan 08 '19 at 17:04
  • **@Kally** if you take the `EmrStep` route, the **cluster-id** a.k.a. `JobFlowId` will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the **private-IP of that cluster's `master`** (which i think you can easily do [via `boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.list_instances)). While I'm a novice with `AWS` infrastructure, i believe `IAM Role`s would come handy for authorization (i assume you already know that) – y2k-shubham Jan 08 '19 at 17:20
  • See [this](https://stackoverflow.com/a/53743370/3679900) for hints on how to modify `Airflow`'s built-in `operator`s to work over `SSH` – y2k-shubham Feb 05 '19 at 19:44
1

As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns

Hope this helps.

Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48
  • Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago – asur Jan 04 '19 at 15:48