In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB).
The data would be in below format
HEADER 1
Record 1
Record 2
.
.
Record n
HEADER 2
Record 1
Record 2
.
.
Record n
HEADER 3
Record 1
Record 2
.
.
Record n
All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code.
The problem over here is my Records are split across different blocks. For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block.
Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1.
Can any one please explain me as how to get the Header and its records completely.