1

Assume a client application that uses a FileSplit object in order to read the actual bytes from the corresponding file.

To do so, an InputStream object has to be created from the FileSplit, via code like:

    FileSplit split = ... // The FileSplit reference
    FileSystem fs   = ... // The HDFS reference

    FSDataInputStream fsin = fs.open(split.getPath());

    long start = split.getStart()-1; // Byte before the first

    if (start >= 0)
    {
        fsin.seek(start);
    }

The adjustment of the stream by -1 is present in some scenarios like the Hadoop MapReduce LineRecordReader class. However, the documentation of the FSDataInputStream seek() method says explicitly that, after seeking to a location, the next read will be from that location, meaning (?) that the code above will be 1 byte off (?).

So, the question is, would that "-1" adjustment be necessary for all InputSplit reading cases?

By the way, if one wants to read a FileSplit correctly, seeking to its start is not enough, because every split also has an end that may not be identical to the end of the actual HDFS file. So, the corresponding InputStream should be "bounded", i.e. have a maximum length, like the following:

    InputStream is = new BoundedInputStream(fsin, split.getLength());

In this case, after the "native" fsin steam has been created above, the org.apache.commons.io.input.BoundedInputStream class is used, to implement the "bounding".

UPDATE

Apparently the adjustment is necessary only for use cases line the one of the LineRecordReader class, which exceeds the boundaries of a split to make sure that it reads the full last line.

A good discussion with more details on this can be found in an earlier question and in the comments for MAPREDUCE-772.

Community
  • 1
  • 1
PNS
  • 19,295
  • 32
  • 96
  • 143

1 Answers1

2

Seeking to position 0 will mean the next call to InputStream.read() will read byte 0. Seeking to position -1 will most probably throw an exception.

Where specifically are you referring to when you talk about the standard pattern in examples and source code?

Splits are not neccessarily bounded as you note - take TextInputFormat for example and files that can be split. The record reader that processes the split will:

  • Seek to the start index, then find the next newline character
  • Find the next newline character (or EOF) and return that 'line' as the next record

This repeats until either the next newline found is past the end of the split, or the EOF is found. So you see that i this case the actual bounds of a split might be right shifted from that given by the Input split

Update

Referencing this code block from LineRecordReader:

if (codec != null) {
  in = new LineReader(codec.createInputStream(fileIn), job);
  end = Long.MAX_VALUE;
} else {
  if (start != 0) {
    skipFirstLine = true;
    --start;
    fileIn.seek(start);
  }
  in = new LineReader(fileIn, job);
}
if (skipFirstLine) {  // skip first line and re-establish "start".
  start += in.readLine(new Text(), 0,
                       (int)Math.min((long)Integer.MAX_VALUE, end - start));
}

The --start statement is most probably to deal with avoiding the split starting on a newline character and returning an empty line as the first record. You can see that if the seek occurs, the first line is skipped to ensure the file splits don't return overlapping records

Chris White
  • 29,949
  • 4
  • 71
  • 93
  • The code above will not seek to -1, but to start-1, i.e. to the previous byte of the start of every split after the 1st. The LineRecordReader used by TextInputFormat of course works as you said, since it always reads at least 1 more byte beyond the split and continues until the line end. Maybe the -1 is in account for that. – PNS Apr 24 '13 at 00:09
  • Can you point me towards source / examples that exhibit this -1 seeking behaviour – Chris White Apr 24 '13 at 00:25
  • Line 59 of http://javasourcecode.org/html/open-source/hadoop/hadoop-1.0.3/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java.html doesn't seek to Start - 1... – Chris White Apr 24 '13 at 00:27
  • There are numerous examples online, but they are associated with the LineReader, so I was wondering whether anything special happens in general. heck, for instance, http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/. – PNS Apr 24 '13 at 04:25
  • Sorry, my own example cites what you're questioning about! updated answer accordingly – Chris White Apr 24 '13 at 11:31
  • Exactly, which is why I was wondering. If you look at the input splits produced by the NLineInputFormat, which uses the LineRecordReader, they all start with a newline, apart from the first (handled by the --start; statement you also quoted) and the last (if it is smaller than a "full" split). The latter case is not properly handled by the LineRecordReader. – PNS Apr 25 '13 at 07:59
  • NLineInputFormat has a comment in the `getSplitsForFile` method which details the reason: 'NLineInputFormat uses LineRecordReader, which always reads (and consumes) at least one character out of its upper split boundary. So to make sure that each mapper gets N lines, we move back the upper split limits of each split by one character here.' – Chris White Apr 25 '13 at 11:07
  • The --start part of LineRecordReader is for skipping a partial line at a start of a split (B), in case the last line of the previous split (A) crosses split boundaries. The reader reads the last byte of split A and, if that is a new line (i.e., A ended in a line), it will "skip" that newline character and so will start reading from the first line of split B. Otherwise, the last line of A extends over to B, so the first line of B will be consumed and readLine() will start from the second line. – PNS Apr 25 '13 at 18:31
  • What NLineInputFormat does by putting the newline of the last line of a split to the start of the next split (excluding the last) is not necessary for the LineRecordReader to work. If the newline was left in the previous split, the result would still be the same. – PNS Apr 25 '13 at 18:36