I have a YARN cluster and dozens of nodes in the cluster. My program is a map-only job. Its Avro input is very small in size with several million rows, but processing a single row requires lots of CPU power. What I observe is that many maps tasks are running on a single node, whereas other nodes are not participating. That causes some nodes to be very slow and affecting overall HDFS performance. I assume this behaviour is because of the Hadoop data-locality.
I'm curious whether it's possible to switch it off, or is there another way to force YARN to distribute map tasks across more uniformly across cluster?
Thanks!