3

I have a text file of 100 TB and it has multiline records. And we are not given that each records takes how many lines. One records can be of size 5 lines, other may be of 6 lines another may be 4 lines. Its not sure the line size may vary for each record.

So I cannot use default TextInputFormat, I have written my own inputformat and a custom record reader but my confusion is : When splits are happening, I am not sure if each split will contain the full record. Some part of record can go in split 1 and another in split 2. But this is wrong.

So, can you suggest how to handle this scenario so that I guarantee that my full record goes in a single InputSplit ?

Thanks in advance -JE

java_enthu
  • 2,279
  • 7
  • 44
  • 74

2 Answers2

3

You need to know if the records are actually delimited by some known sequence of characters.

If you know this you can set the textinputformat.record.delimiter config parameter to separate the records.

If the records aren't character delimited, you'll need some extra logic that, for example, counts a known number of fields (if there are a known number of fields) and presents that as a record. This usually makes things more complex, prone to error and slow as there's another lot of text processing going on.

Try determining if the records are delimited. Perhaps posting a short example of a few records would help.

Intermernet
  • 18,604
  • 4
  • 49
  • 61
1

In your record reader you need to define an algorithm by which you can:

  • Determine if your in the middle of a record
  • How to scan over that record and read the next full record

This is similar to what the TextInputFormat LineReader already does - when the input split has an offset, the line record reader scans forward from that offset for the first newline it finds and then reads the next record after that newline as the first record it will emit. Tied with this, if the block length falls short of the EOF, the line record reader will upto and past the end of the block to find the line terminating character for the current record.

Chris White
  • 29,949
  • 4
  • 71
  • 93