Hadoop Client Node Configuration

Question

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.

Now i know that hadoop software has to be installed in all those 20 machines.

but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?

score 10 · Answer 1 · edited May 23 '17 at 11:46

I am new to hadoop, so from what I understood:

If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.

An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.

In case it can help someone, here is what I have done to connect to a cluster that I don't administer:

get an account on the cluster, say myaccount
create an account on you computer with the same name: myaccount
configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
get the hadoop configuration files from an edge-node of the cluster
get a hadoop distrib (eg. from here)
uncompress it where you want, say /home/myaccount/hadoop-x.x
add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
(if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"

Then do hadoop fs -ls / which should list the root directory of the cluster hdfs.

score 5 · Answer 2 · answered Mar 07 '14 at 18:18

5

Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.

Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node.

Now in case of a small development environment kind of setup you can use any one of the participating nodes of the cluster to run jobs or run shell commands.

So based on your requirement the definition and placement of client varies.

answered Mar 07 '14 at 18:18

Venkat

1,810
1
11
14

What if a client is dedicated to running various hadoop jobs as well as uploading data to HDFS on demand, using Hive, Pig, Hadoop-GIS, etc. What needs to be done to install, for example, Hadoop-GIS for a user of the cluster, but not administrator? In that respect, what has to be on a client node? What has to be done on a master node for these tools running on a client node and using a cluster? – mel Nov 12 '15 at 14:38
It depends on the type of library you require to use. For e.g. Spark need not be installed anywhere on cluster. It can be on your client node & the job will be submitted to your YARN resource manager. I am not sure about Hadoop-GIS but in most cases client installation & addition of libraries to distributed cache should be enough. – Venkat Nov 13 '15 at 18:23

score 5 · Answer 3 · answered Nov 12 '15 at 16:23

I recommend this article. "Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished."

Hadoop Client Node Configuration

3 Answers3

Linked