How to read a record that is split into multiple lines and also how to handle broken records during input split

Question

I have a log file as below

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah 
owner: abc 
status: resolved 
end .... 13-07-2008 02:00:05 
Begin ... 12-07-2008 03:00:05         ----> record2 
incidentID: inc002 
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc 
status: resolved 
end .... 13-07-2008 03:00:05

I want to use mapreduce for processing this. And I want to extract the incident ID, status and also the time taken for the incident

How to handle both the records as they have variable record lengths and what if the input split happens before the record ends.

score 6 · Accepted Answer · answered Jul 18 '13 at 10:36

You'll need to write your own input format and record reader to ensure proper file splitting around your record delimiter.

Basically your record reader will need to seek to it's split byte offset, scan forward (read lines) until it finds either:

the Begin ... line
- Read lines upto the next end ... line and provide these lines between the begin and end as input for the next record
It scans pasts the end of the split or finds EOF

This is similar in algorithm to how Mahout's XMLInputFormat handles multi line XML as input - in fact you might be able to amend this source code directly to handle your situation.

As mentioned in @irW's answer, NLineInputFormat is another option if your records have a fixed number of lines per record, but is really inefficient for larger files as it has to open and read the entire file to discover the line offsets in the input format's getSplits() method.

Your answer gives some insight to first problem (a record that is split into multiple lines ) . Can you please provide your views on other problem viz. "how to handle broken records during input split" — Kaushik Lele, Jun 07 '15 at 12:32
XmlInputFormat handles broken lines by skipping until an open record tag is found when processing a new split. It also goes past the end of the split to a valid end record tag so that no data is lost. — Ed Bayiates, Sep 27 '18 at 21:31

score 1 · Answer 2 · answered Jul 18 '13 at 10:25

1

in your examples each record has the same number of lines. If that is the case you could use NLinesInputFormat, if it is impossible to know the number of lines it might be more difficult. (more info on NlinesInputFormat: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html )

answered Jul 18 '13 at 10:25

DDW

1,975
2
13
26

Here the input number of lines are not fixed. The number of lines may vary. – ghosts Sep 06 '13 at 18:04
I think Chris White's answer should solve your problems. You have two options: preprocess your log file so it can be read with xmlinputformat or write your own inputformat which will be very similar to the xmlinputformat. – DDW Sep 08 '13 at 09:50

How to read a record that is split into multiple lines and also how to handle broken records during input split

2 Answers2