Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.
What I did, in a nutshell, was:
- Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
- Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
- Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run
sudo ldconfig
, see this post.
- Deploy the resulted dist tar from
hadoop-2.2.0-src/hadoop-dist/target
on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
- Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g.
c:\java\jdk1.7
.
- Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
- Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
- Set up
JAVA_HOME
, HADOOP_HOME
and PATH
environment variables as described in these instructions
- Use a text editor or
unix2dos
(from Cygwin or standalone) to convert all .cmd
files in the bin
and etc\hadoop
directories, otherwise you'll get weird errors about labels when running them.
- Configure the connection properties to your cluster in your config XML files, namely
fs.default.name
, mapreduce.jobtracker.address
, yarn.resourcemanager.hostname
and the alike.
- Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.
If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!