multiple gz files go to one hadoop node

Question

I have 7 very big gz files, each has 10G Bytes data, And also I have 100 small bzip2 files each just has 10M Bytes. And I have 10 machines in hadoop cluster, each machine has 8 cores. When I kick off the map reduce job, the 100 small bzip2 files would be finished in 1 minute. And the 7 big gz files would take very long time. My question is: why the 7 gz files go to only one machine even though I have 10 machines there, it caused one machine works very hard, and other 9 machines almost do nothing. I'm curious about this, and I tried set up mapred.tasktracker.map.tasks.maximum=1, that means only one task will simultaneously run in one machine, but after setting this I still got the 7 files running on one machine, i.e 7 mappers(jvms) running on one machine at the same time.

Please help me to fan out the 7 mappers to 7 machines rather one machine, thanks in advance!

See http://stackoverflow.com/questions/5630245/hadoop-gzip-compressed-files Gzip files are not splittable — Clément MATHIEU, Sep 12 '14 at 21:37
That doesn't answer why the seven gzip files don't go to seven machines. — Mark Adler, Sep 12 '14 at 22:22

score 0 · Answer 1 · answered Sep 14 '14 at 05:48

0

Maybe the files are on an unbalanced HDFS, or local to one FS? Maybe you need to run a hdfs rebalance to spread the files over the cluster.

answered Sep 14 '14 at 05:48

samthebest

30,803
25
102
142

multiple gz files go to one hadoop node

1 Answers1