I want to read sub-content the big file from some offset/position. For example I have a file of 1M lines and I want to read 50 lines starting from 100th. (line no: 101 to 150 - both inclusive)
I think I should be using PositionalReadable. https://issues.apache.org/jira/browse/HADOOP-519
I see that FSInputStream.readFully
actually uses seek()
method of Seekable
.
When I check the underlying implementation of seek()
I see that it uses BlockReader.skip()
Wouldn't the blockReader.skip() read the whole data till position to skip the bytes? Question is would HDFS load first 100 lines as well in order to get to 101th line.
How to make position to be at any desired offset in the file like 10000th line of the file without loading the rest of the content? Something what s3 offers in header-offsets.
Here is the similar question I found: How to read files with an offset from Hadoop using Java, but it suggest using seek()
and that is argued in the comments that seek()
is expensive operation and should be used sparingly. Which I guess is correct because seek seems to read all the data in order to skip till the position.