1

We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?

UPDATE: Adding a bit of context

We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?

blong
  • 2,815
  • 8
  • 44
  • 110
  • By default, MR jobs alway consume all local data from HDFS first. Only if no more local data is available, remote blocks are transfered via network (and this is reasonable -- why should an worker be waiting for the other mappers to finish, if there is some data left). Is this not sufficient for your case? – Matthias J. Sax Aug 03 '15 at 14:33
  • Pardon my ignorance, I think because I don't fully understand the standard MR details, that it's made my search for the answer more difficult. In your terms, "_By default, MR jobs always consume all local data from HDFS first._". What I've been asked to find is whether we can create a job that "_consume all *local data* from HDFS *only*_". I might have been wrong to assume our algorithm would run in a `Mapper`. Perhaps the `Mapper` would be an identity function, while the `Reducer` would perform the actual calculation. Perhaps I need to ask about job control or sorting control? – blong Aug 03 '15 at 14:44
  • I understand the question. I am not aware that you can do it. And I don't see any advantage in doing it. It might increase your execution time... – Matthias J. Sax Aug 03 '15 at 14:47
  • Ah, ok, sorry. I think you're right to wonder if this would increase our execution time. We share that concern, but we want to determine if through particular configuration we can actually decrease the execution time; since in our case network latency is a bottleneck. – blong Aug 03 '15 at 14:51

2 Answers2

2

If you set the number of reduce tasks to zero you can skip the shuffling and therefore the network cost of your algorithm.

While creating your job this can be done with the following line of code

job.setNumReduceTasks(0);

I don't know what you algorithm will do but say it is a pattern matching algorithm looking for the occurrence of a particular word, then the mappers would report the number of matches per split. If you want to add the counts you need network communication and a reducer.

First google match on a map-only example I found: Map-Only MR jobs

DDW
  • 1,975
  • 2
  • 13
  • 26
  • Thanks @DDW! Your answer lead me to find even more information about this strategy, namely these two SO threads: [How to write 'map only' hadoop jobs?](http://stackoverflow.com/q/9394409/320399) and [hadoop: difference between 0 reducer and identity reducer?](http://stackoverflow.com/q/10630447/320399). I'd like to do a little more research on this before marking your answer as correct, but I will come back to it and appropriately marked the proper answer. – blong Aug 03 '15 at 18:05
  • @blong: When choosing between 'map only' and 'identity reducer' remember the effect on the number of output files (parts) of your job. This number may have a significant effect on the job you run after this job. – Niels Basjes Aug 04 '15 at 05:43
1

Setting the reducers to zero would increase the data locality. This means the intermediate data that have been generated by the Mappers will be stored on HDFS. Of course, you will not have any control of choosing the which nodes will store the intermediate data and if its size is greater than the number of the mapper slots * block size, then the remote access would be attempt to avoid starvation. My advice is to use delay scheduler and set locality-delay-node-ms and locality-delay-rack-ms to a large value (i.e. the maximum expected running time for your mappers). This will make the delay scheduler waits as much as possible before requesting data remotely. However, this may lead to resource under-utilization and increasing the running time (e.g. any node that does not store any data block will be idle for a long time locality-delay-node-ms + locality-delay-rack-ms).

Yahia
  • 1,209
  • 1
  • 15
  • 18