0

I have a folder with 3072 files, each of ~50mb. I'm running a Python script over this input using Hadoop Streaming and extracting some data.

On a single file, the script doesn't take more than 2 seconds. However, running this on an EMR cluster with 40 m1.large task nodes and 3072 files takes 12 minutes.

Hadoop streaming does this:

14/11/11 09:58:51 INFO mapred.FileInputFormat: Total input paths to process : 3072
14/11/11 09:58:52 INFO mapreduce.JobSubmitter: number of splits:3072

And hence 3072 map tasks are created.

Of course the Map Reduce overhead comes into play. From some initial research, it seems that it's very inefficient if map tasks take less than 30-40 seconds.

What can I do to reduce the number of map tasks here? Ideally, if each task handled around 10-20 files it would greatly reduce the overhead.

I've tried playing around with the block size; but since the files are all around 50mb in size, they're already in separate blocks and increasing the block size makes no differenece.

user1265125
  • 2,608
  • 8
  • 42
  • 65

2 Answers2

0

Unfortunately you can't. The number of map tasks for a given job is driven by the number of input splits. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits.

SMA
  • 36,381
  • 8
  • 49
  • 73
0

mapred.min.split.size will specify the minimum split size to process by a mapper.

So, increasing split size should reduce the no of mappers.

Check out the link Behavior of the parameter "mapred.min.split.size" in HDFS

Community
  • 1
  • 1
Vijay Innamuri
  • 4,242
  • 7
  • 42
  • 67