How to create a key, value pair in mapreduce program if values are stored across the boundaries ?

Question

In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB).

The data would be in below format

HEADER 1
Record 1
Record 2
.
.
Record n
HEADER 2
Record 1
Record 2
.
.
Record n
HEADER 3
Record 1
Record 2
.
.
Record n

All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code.

The problem over here is my Records are split across different blocks. For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block.

Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1.

Can any one please explain me as how to get the Header and its records completely.

I have tried to answer something similar here: http://stackoverflow.com/questions/32758322/how-do-we-count-the-number-of-times-a-map-function-is-called-in-a-mapreduce-prog — YoungHobbit, Sep 28 '15 at 12:41
@YoungHobbit I think the issue here is unknown number of records after header, and also header and record getting combined. — Ramzy, Sep 29 '15 at 16:47

score 0 · Answer 1 · edited May 23 '17 at 12:06

0

You need a custom recordreader and custom linereader to process in such a way rather than reading each line.

Since the splits are calculated in the client, every mapper already knows if it needs to discard the records of previous header or not.

Hope this below link might be helpful How does Hadoop process records split across block boundaries?

edited May 23 '17 at 12:06

Community

1
1

answered Sep 28 '15 at 11:13

Abhinay

612
1
6
23

score 0 · Answer 2 · answered Sep 29 '15 at 16:44

You have two ways:

A single mapper handling all the records, so you have the complete data in single class, and you decide how to separate them. Given the input size, this will have performance issues. More Info at Hadoop Defintive guide, MR Types and Formats, Input Formats, Prevent Splitting. Less coding effort, and if your mapper has less data and running frequently, this approach is ok.
If you plan to use custom split and record reader, you are modifying the way the framework works. Because, your records are similar to TextInputFormat. So mostly no need to plan for custom record reader. However you need to define how the splits are made. In general, splits are divided mostly equal to block size, to take advantage of data locality. In your case, your data(mainly the header part) can end at any block and you should split accordingly. All the above changes need to be made, to make map reduce work with the data you have.

score 0 · Answer 3 · answered Sep 29 '15 at 18:32

0

You can increase the default size of HDFS block to 128MB and if the file is small it will take that as one block.

answered Sep 29 '15 at 18:32

Ajay Gupta

3,192
1
22
30

How to create a key, value pair in mapreduce program if values are stored across the boundaries ?

3 Answers3