Does record splitting need to generate unique keys for each record in hadoop?

Question

I am relatively new to the hadoop world. I have been following examples I could find to understand how the record splitting step works for mapreduce jobs. I noticed that TextInputFormat splits file into records with key as the byte offset and value as a string. In this case, we could have two different records in a mapper having same offset from different input files.

Does it affect the mapper in any way? I think the uniqueness of the key to mapper is irrelevant if we do not process it (e.g. wordcount). But if we have to process it in mapper, the key may have to be unique. Can anyone elaborate on this ?

Thanks in advance.

You may be interested in this post: http://stackoverflow.com/questions/18642875/extend-sequencefileinputformat-to-include-file-nameoffset — frb, Apr 29 '15 at 20:52
@frb : Thanks for sharing the link. It is an interesting use case. But it still does not answer my question : could there be a scenario where we may need to have unique key for each record that goes to a mapper. In this post, the user is not processing it so uniqueness does not come into picture. — Santanu C, Apr 29 '15 at 21:02

score 2 · Accepted Answer · answered Apr 29 '15 at 21:49

Input to mapper is a file (or hdfs block) and not a key-value pair. In other words, mapper itself creates key-value pairs and does not get impacted by duplicate keys.

The "final" output generated by a Mapper is a multivalued hashmap.

< Key, <List of Values>>

This output becomes input to Reducer. All values of a key are processed by same reducer. It is ok for mappers to create more than one value for a key. Infact some solutions depend on this behavior.

score 1 · Answer 2 · answered Apr 30 '15 at 05:10

Actually answer to your question totally depends on the scenario.

If you are not utilizing key(i.e. byte offset in case of textinputformat which is least utilized, but probably if you are using keyvalusepairInputformat you may be utilizing it.) then it never Impact, but if your map() function logic is such that you are doing some calculations on the basis of key then it will definitely impact.

So it totally depends on the scenario.

score 1 · Answer 3 · answered Apr 30 '15 at 06:30

There is a misunderstanding. Actuallty,

For every input split of the file one Mapper will be assigned. All the records from a single input split will be processed by only one Mapper for a given job.

It is not required to bother about records with duplicate keys are arriving for mapper since mapper's execution scope is always at one key/value pair at any point of time.

The output from mapper task which is n number of key/value pairs are eventually merged, sorted and partitioned based on the keys.

The reducer will collect the required outputs from all mappers according to the partition and brings to the memory of reducer where it handles to arrange the key/value pairs as <key , Iterable <value> > .

Does record splitting need to generate unique keys for each record in hadoop?

3 Answers3