0

I have 7 very big gz files, each has 10G Bytes data, And also I have 100 small bzip2 files each just has 10M Bytes. And I have 10 machines in hadoop cluster, each machine has 8 cores. When I kick off the map reduce job, the 100 small bzip2 files would be finished in 1 minute. And the 7 big gz files would take very long time. My question is: why the 7 gz files go to only one machine even though I have 10 machines there, it caused one machine works very hard, and other 9 machines almost do nothing. I'm curious about this, and I tried set up mapred.tasktracker.map.tasks.maximum=1, that means only one task will simultaneously run in one machine, but after setting this I still got the 7 files running on one machine, i.e 7 mappers(jvms) running on one machine at the same time.

Please help me to fan out the 7 mappers to 7 machines rather one machine, thanks in advance!

Jack
  • 5,540
  • 13
  • 65
  • 113

1 Answers1

0

Maybe the files are on an unbalanced HDFS, or local to one FS? Maybe you need to run a hdfs rebalance to spread the files over the cluster.

samthebest
  • 30,803
  • 25
  • 102
  • 142