4

I want to run a hadoop job remotely from a windows machine. The cluster is running on Ubuntu.

Basically, I want to do two things:

  1. Execute the hadoop job remotely.
  2. Retrieve the result from hadoop output directory.

I don't have any idea how to achieve this. I am using hadoop version 1.1.2

I tried passing jobtracker/namenode URL in the Job configuration but it fails.

I have tried the following example : Running java hadoop job on local/remote cluster

Result: Getting error consistently as cannot load directory. It is similar to this post: Exception while submitting a mapreduce job from remote system

Community
  • 1
  • 1
user1927808
  • 577
  • 1
  • 10
  • 23

1 Answers1

3

Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.

What I did, in a nutshell, was:

  1. Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
  2. Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
  3. Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run sudo ldconfig, see this post.
  4. Deploy the resulted dist tar from hadoop-2.2.0-src/hadoop-dist/target on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
  5. Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g. c:\java\jdk1.7.
  6. Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
  7. Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
  8. Set up JAVA_HOME, HADOOP_HOME and PATH environment variables as described in these instructions
  9. Use a text editor or unix2dos (from Cygwin or standalone) to convert all .cmd files in the bin and etc\hadoop directories, otherwise you'll get weird errors about labels when running them.
  10. Configure the connection properties to your cluster in your config XML files, namely fs.default.name, mapreduce.jobtracker.address, yarn.resourcemanager.hostname and the alike.
  11. Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.

If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!

Community
  • 1
  • 1
Little Bobby Tables
  • 5,261
  • 2
  • 39
  • 49