Hadoop client and cluster separation

Question

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?

Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.

And I have some naiive ideas:

1.Create a client machine and install hadoop .

2.set fs.default.name to be hdfs://master:9000

3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode Is it correct?

4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.

5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.

score 2 · Answer 1 · answered Feb 10 '16 at 16:59

First of all.. this link has detailed information on how client communcates with namenode

http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2

To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.

Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.

Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode. To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode. Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.

Go through configuration on Namenode,

core-site.xml will have this property-

<property>
        <name>fs.default.name</name>
        <value>192.168.0.1:9000</value>
</property>

mapred-site.xml will have this property-

<property>
      <name>mapred.job.tracker</name>
      <value>192.168.0.1:8021</value>
 </property>

These are two important properties must be copied to client machine’s Hadoop configuration. And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.

<property>
      <name>mapreduce.jobtracker.staging.root.dir</name>
      <value>/user</value>
</property>

Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.

Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.

score 1 · Accepted Answer · answered Feb 10 '16 at 12:28

1

Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.

If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

answered Feb 10 '16 at 12:28

facha

11,862
14
59
82

The easiest way is to copy configs from cluster node to the client exactly as they are. Normally hadoop configs are located in /etc/hadoop/conf. Copy this directory from Master to Client machine. You should have the same version of hadoop installed in cluster and client. – facha Feb 18 '16 at 15:00
then how about the datanodes and namenodes？ they will also be in the client harddisk？ – boygood Feb 18 '16 at 15:05
1

You don't have to run any services on your client machines. By "having the same version of hadoop installed" I mean, just have it all laying on your hard disk (binaries, libraries, etc...). – facha Feb 18 '16 at 15:40
do I need to start-all on the client machine？ do I need to add the client machine to the hosts file of the cluster machines？ – boygood Feb 18 '16 at 16:41
no, you don't have to start any hadoop service. I'm not sure about the "hosts" file. Is it /etc/hosts? If yes, then yes, you should add your client's hostname to /etc/hosts on all cluster nodes. – facha Feb 18 '16 at 21:59

Hadoop client and cluster separation

2 Answers2

Linked