Spark Job Submission with AWS Hadoop cluster setup

Question

I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?

1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.

2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?

3) Do I need to configure SSL also to authenticate my local system?

Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.

Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.

Just think of your local machine as any "edge node". You need all the Hadoop+Hive XML configuration files locally. For Hive you can explictly set `hive.metastore.uris` in your Spark code. https://stackoverflow.com/questions/31980584/how-to-connect-to-a-hive-metastore-programmatically-in-sparksql#31993754 — OneCricketeer, Aug 28 '17 at 07:10
but as per my understanding edge node has to be part of same cluster , is not so ? Do you mean in this case my local system also have to include in AWS hadoop cluster system — Biswajit, Aug 28 '17 at 07:59
Edge node is any computer on the perimeter of the network. Client-only configuration files. No running cluster services. You will need to open the necessary ec2 ports for all the services, though. (NameNode, thrift, Datanode, Spark History Server, ResourceManager, etc, etc) — OneCricketeer, Aug 28 '17 at 08:09
You can refer other question I just answered: https://stackoverflow.com/questions/45911587 — OneCricketeer, Aug 28 '17 at 08:11

OneCricketeer · Answer 1 · 2017-08-28T08:36:07.227

To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.

However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.

Spark submit YARN mode HADOOP_CONF_DIR contents

Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.

Not sure about Python, R, or the SSL configs.

And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.

Spark Job Submission with AWS Hadoop cluster setup

1 Answers1