We have particular algorithm that we want to integrate with HDFS. The algorithm requires us to access data locally (the work would be done exclusively in the Mapper
). However, we do want to take advantage of HDFS in terms of distributing the file (providing reliability and striping). After the calculation is performed, we'd use the Reducer
to simply send back the answer, rather than perform any additional work. Avoiding network use is an explicit goal. Is there a configuration setting that would allow us to restrict network data access, so that when a MapReduce job is started it will only access it's local DataNode?
UPDATE: Adding a bit of context
We're attempting to analyze this problem with string matching. Assume our cluster has N nodes and a file is stored with N GB of text. The file is stored into HDFS and distributed in even parts to the nodes (1 part per node). Can we create a MapReduce job that launches one process on each node to access the part of the file that's sitting on the same host? Or, would the MapReduce framework unevenly distribute the work? (e.g. 1 job accessing all N part of the data, or .5N nodes attempting to process the whole file?