hadoop: tasks not local with file?

Question

I ran a hadoop job and when I look in some map tasks I see they are not running where the file's blocks are. E.g., the map task runs on slave1, but the file blocks (all of them) are in slave2. The files are all gzip.

Why is that happening and how to resolve?

UPDATE: note there are many pending tasks, so this is not a case of a node being idle and therefore hosting tasks that read from other nodes.

what's your replication factor? 3? Sometimes when you have a low rep factor this happens. — Donald Miner, Dec 19 '13 at 14:58
it is 1. this is a research cluster, not production. but why should this happen? hadoop should still pick the local node to run the task on. — IttayD, Dec 19 '13 at 15:02

cabad · Accepted Answer · 2013-12-19T19:19:57.377

Hadoop's default (FIFO) scheduler works like this: When a node has spare capacity, it contacts the master and asks for more work. The master tries to assign a data-local task, or a rack-local task, but if it can't, it will assign any task in the queue (of waiting tasks) to that node. However, while this node was being assigned this non-local task (we'll call it task X), it is possible that another node also had spare capacity and contacted the master asking for work. Even if this node actually had a local copy of the data required by X, it will not be assigned that task because the other node was able to acquire the lock to the master slightly faster than the latter node. This results in poor data locality, but FAST task assignment.

In contrast, the Fair Scheduler uses a technique called delayed scheduling that achieves higher locality by delaying non-local task assignment for a "little bit" (configurable). It achieves higher locality but at a small cost of delaying some tasks.

Other people are working on better schedulers, and this may likely be improved in the future. For now, you can choose to use the Fair Scheduler if you wish to achieve higher data locality.

I disagree with @donald-miner's conclusion that "With a default replication factor of 3, you don't see very many tasks that are not data local." He is correct in noting that more replicas will give improve your locality %, but the percentage of data-local tasks may still be very low. I've also ran experiments myself and saw very low data locality with the FIFO scheduler. You could achieve high locality if your job is large (has many tasks), but for the more common, smaller jobs, they suffer from a problem called "head-of-line scheduling". Quoting from this paper:

The ﬁrst locality problem occurs in small jobs (jobs that have small input ﬁles and hence have a small number of data blocks to read). The problem is that whenever a job reaches the head of the sorted list [...] (i.e. has the fewest running tasks), one of its tasks is launched on the next slot that becomes free, no matter which node this slot is on. If the head-of-line job is small, it is unlikely to have data on the node that is given to it. For example, a job with data on 10% of nodes will only achieve 10% locality.

That paper goes on to cite numbers from a production cluster at Facebook, and they reported observing just 5% of data locality in a large, production environment.

Final note: Should you care if you have low data locality? Not too much. The running time of your jobs may be dominated by the stragglers (tasks that take longer to complete) and shuffle phase, so improving data locality would only have a very modest improve in running time (if any at all).

good answer. i learned a few things here... didn't realize the huge difference between FIFO and Fair. I agree that this is probably a bigger factor than # of replicas. I have personally noticed worse data locality in Fair with 1 replica vs. 3. — Donald Miner, Dec 19 '13 at 16:23
note that the paper says there is a 7% improvement in locality, not 7% data locality. Very different. — Donald Miner, Dec 19 '13 at 16:24
@donald-miner Thanks, I hadn't noticed that. I'll update my answer. — cabad, Dec 19 '13 at 16:32
I wonder if OP is using FIFO scheduler then? On my large multi-rack clusters using Fair Scheduler and 3 replicas, I'd say I get 97%+ node local and ~3% rack local. Can't say the last time I've seen something out of a rack. — Donald Miner, Dec 20 '13 at 02:46
I also agree with your assertion that data locality really isn't that big of a deal. Especially with 10GigE on a smaller cluster. — Donald Miner, Dec 20 '13 at 02:47
I have 4 machines and I have set replication factor of 4 on all my input files and switched to the Fair scheduler. When I look at a task running on node X, I see the input split locations are on all 4 machines. Does that mean the task will read from the location local to it? — IttayD, Dec 20 '13 at 05:22
@ittayd If you have 4 machines and 4 replicas, you should get 100% data locality. However, the Fair Scheduler achieves high locality even when # replicas < # machines (due to the Delay Scheduling technique). — cabad, Dec 20 '13 at 16:58
@donald-miner Yes, it appears OP was using the default (FIFO) scheduler. — cabad, Dec 20 '13 at 16:59

score 1 · Answer 2 · answered Dec 19 '13 at 15:56

Unfortunately, the default scheduler isn't that smart. I'm not sure exactly what's going on, but I think it's using some sort of greedy-style scheduling where it tries to schedule what it can now for the next task, and then moves on. There could definitely be improvements made to the hadoop scheduler and there have been a few academic attempts and making hadoop scheduling more optimal.

This research paper shows that the default hadoop scheduler is not optimal. In the results, they show that increasing the replication factor to three improves data locality significantly, with diminishing returns after that.

So, why hasn't the default scheduler been improved? Here is my opinion/theory: With a default replication factor of 3, you don't see very many tasks that are not data local. By having more replicas, you give the schedule more flexibility to fit tasks in the right spots. Basically, it's a coincidence that you have 3 replicas, and the default scheduler takes advantage of that by being implemented in a lazy manner. Since you typically have 3 replicas for redundancy sake already... there isn't much motivation to help scheduler performance for people with a replication of 1.

If you have the space, I suggest just upping the replication factor to two or three. There really isn't much downside.

hadoop: tasks not local with file?

2 Answers2

Linked