0

I have a small hadoop (2.5.1) cluster where I have the following configuration

(concerning memory limits) mapred-site.xml:

    <property>
            <name>mapreduce.map.memory.mb</name>
            <value>3072</value>
    </property>
    <property>
            <name>mapreduce.reduce.memory.mb</name>
            <value>2048</value>
    </property>
    <property>
            <name>mapreduce.map.java.opts</name>
            <value>-Xmx2450m</value>
    </property>
    <property>
            <name>mapreduce.reduce.java.opts</name>
            <value>-Xmx1630m</value>
    </property>

yarn-site.xml:

      <property>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>13312</value>
    </property>

And a map streaming task with python (without a reducer) where I just read lines from a file and select specific fields to print out (I keep one of the fields as a key and the rest one big string).

Each line holds quite a big of an array so the default hadoop configuration was changed to the one above (only to make sure that each record would fit a mapper and so I can test my code without worrying about memory). Each line/record though is smaller than the blocksize (which I have left with the default value).

My problem is that when I test my code at a 7gb sample of the original file everything runs perfectly, BUT when I try it on the original file (~100GB) about 50% of the mapping stage I get the error that "Container is running beyond physical memory for larger files" where it reports it has gone over the 3GB limit.

Why does a mapper need more memory for a larger file? Isn't the computation supposed to be on record by record? If the block size is smaller (by a lot) than the available memory, how does a mapper end up using more than 3GB?

I find this issue a little perplexing.

user1676389
  • 73
  • 10

1 Answers1

0

If I'm interpreting your scenario correctly, it isn't that a single mapper is bankrupting your memory, it's possible that many more mappers are being spawned in parallel since there are so many more blocks of input - this is where much of Hadoop's parallelism comes from. The memory error is probably from too many mappers trying to run at the same time per node. If you have a small cluster, you probably need to keep the mappers/node ratio lower for larger input sets.

This SO question/answer has more details about to affect the mapper count. Setting the number of map tasks and reduce tasks

Community
  • 1
  • 1
rchang
  • 5,150
  • 1
  • 15
  • 25
  • Ok, I can decrease the number of mappers by increasing the input size (because setting the number does nothing by my experience). But how much memory should I assume each mapper needs? For example, in my configuration, what would be an appropriate input size for a map? – user1676389 Nov 27 '14 at 13:55
  • The memory usage profile for each mapper would largely depend upon what it's doing - the number and size of intermediary objects that have to be constructed before the mapper's output is emitted, for example. If there is no substantial way to optimize the mapper implementation, there may still be other options like adding more nodes to the cluster (spread the same number of maps over more nodes), or increase block size (as long as your memory consumption isn't proportional to block size). – rchang Nov 27 '14 at 19:15
  • That does not explain why I have an error in a larger version of the same file but not on a sample. Moreover if I increase the block size that will decrease the number of splits but it will have the same effect as increasing the input size of each mapper. I tried both increasing both block size and input size but that makes the mapper fail faster on the larger file (because it reaches the memory limits I've set). I think the original answer is closer to what the solution but I still don't understand what is the relationship between the block size and how much memory each map needs. – user1676389 Nov 28 '14 at 19:10
  • My understanding is that a larger input file means more blocks in the input set (for a fixed block size), which means more mappers will be spawned. – rchang Nov 28 '14 at 20:51