What determines locality rate for a MapReduce application?

Question

When running MapReduce applications (sort vs word count for example) I noticed that locality rate can be different.

How does locality rate depend on the application and input files? Why do some applications have a higher locality rate than others?

score 0 · Answer 1 · edited May 23 '17 at 12:07

Have a look at Yarn tutorial

YARN

When YARN select a node manager by checking the resource availability & if that node manager is different from Data node where the data is stored, data locality concept is broken. In this case, data will be transferred from one node to other node over the network. The reason for this event to happen : The node manager is busy and constrained by CPU & Memory requirements.

Assume that a particular node manager has been identified for Map job to read data. But if data spans across multiple nodes, then data should be transferred between the nodes.

Hadoop use logical split instead of physical split in Map reduce framework. Input splits depends on where records are written.

Assume that DFS block size is 64 MB in one data node. If last record in that block was not completely written due to size constraint. Assume that half of the row has been written in data node 1 (1 MB) and remaining half is written in data node 2 ( 1 MB of data in one more 64 MB DFS block).

During processing of Map reduce job, the data from data node 2 will be transferred over network to complete logical split.

Have a look at my post in some other SE question, which nicely explains the input split process.

From Apache Map reduce tutorial

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

How many Reducers?

Reducer reduces a set of intermediate values which share a key to a smaller set of values. The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

Assume that Mappers output is ready. If Mapper node & Reducer node are different, then data will be transferred over the network. Hadoop frameworks decided how many mappers and how many reducers for a given Hadoop job.

What determines locality rate for a MapReduce application?

1 Answers1