1

I am trying to set up a fully-distributed Hadoop/MapReduce instance where each node will be running a series of C++ Hadoop Streaming task on some input. However I don't want to move all the input tasks onto HDFS - instead I want to see if there is a way to read input data from the local folders of each node.

Is there anyway to do this?

EDIT: An example of a hadoop command I would like to run is something similar to:

hadoop jar $HADOOP_STREAM/hadoop-streaming-0.20.203.0.jar \
            -mapper map_example \
            -input file:///data/ \
            -output /output/ \
            -reducer reducer_example \
            -file map_example \
            -file reducer_example 

In this case, the data stored in each of my nodes is in the /data/ directory, and I want the output to go to the /output/ directory of each individual node. The map_example and reducer_example files are locally available in all nodes.

How would I be able to implement a Hadoop command which if it is run on the master node, then all the slave nodes will essentially run the same task on an x number of nodes, resulting in a local output file in each node (based on the local input files)?

Thanks

Ken
  • 1,498
  • 2
  • 12
  • 19

2 Answers2

1

As noted by this question, this appears possible. Though I have not tested this, it appears that you can set fs.default.name in conf/core-site.xml to refer to a file URL instead of an HDFS URL.

Some refs:

Community
  • 1
  • 1
Emil Sit
  • 22,894
  • 7
  • 53
  • 75
  • Hi Emil, Thanks for those resources. You may not be able to answer this question, but do you have any idea how exactly Hadoop/MapReduce should be set up so I can have a fully-distributed instance running individual Hadoop streaming tasks based on local input files? I've updated my question with a more detailed description. Thanks! – Ken Nov 21 '11 at 08:29
  • 1
    I don't know. It does sound a little bit like you don't actually want MapReduce, just distributed execution. For that, a tool like [fabric](http://fabfile.org) might be more appropriate. – Emil Sit Nov 23 '11 at 00:41
  • I've decided to change my approach slightly and I've altered my instance so I essentially want to run fully-distributed instance of MapReduce (e.g. combined output from all the nodes into one file on HDFS) with the exception that identical copies of my input files are found locally on each node as opposed to uploaded onto HDFS. I have not found a way to successfully do this yet. – Ken Nov 23 '11 at 18:00
0

This is not exactly a hadoop solution but you could write a program(say Python) that fork multiple processes that will ssh into each of the slave machines and run the map reduce code.

hadoop dfsadmin -report allows you to list the ips in the cluster. You can make each process ssh into each of the ips and run the mapper and reducer.

Map reduce in *nix can be implemented using pipes.

cat <input> | c++ mapper | sort | c++ reducer > <output_location>

viper
  • 2,220
  • 5
  • 27
  • 33