How to read files with an offset from Hadoop using Java

Question

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.

I don't want to use seek because I have read that it is expensive.

I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)

Currently I am using a BufferedReader to return small summary files which is working fine

ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
    // ignoring files like _SUCCESS
    if(item.getPath().getName().startsWith("_")) {
        continue;
    }           

    in = fs.open(item.getPath());
    BufferedReader br = new BufferedReader(new InputStreamReader(in));

    String line;
    line = br.readLine();
    while (line != null) {
        line = line.replaceAll("(\\r|\\n)", "");
        lines.add(line.split("\t"));
        line = br.readLine();
    }
}

I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.

Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.

Thanks!

As added noted based on research from the below discussions: How does Hadoop process records records split across block boundaries? Hadoop FileSplit Reading

If I understand you well, your code is working but you would like to optimize it in order to read only some parts of the file you're reading ? — merours, Jun 05 '14 at 15:33
Do you know the byte offset to the start of the line you want to seek to? Seek is less expensive than scanning line-by-line and throwing away the ones you don't want. — Mike Park, Jun 05 '14 at 19:43
A good question. I have started to think about how I would figure this out. Probably someone else has already done so and I need to poke around the net. — dbg, Jun 05 '14 at 20:42

vikeng21 · Accepted Answer · 2014-06-05T15:53:35.857

4

I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.

public class HDFSClientTesting {

/**
 * @param args
 */
public static void main(String[] args) {
    // TODO Auto-generated method stub

  try{

 //System.loadLibrary("libhadoop.so");
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    conf.addResource(new Path("core-site.xml"));


    String Filename = "/dir/00000027";
    long ByteOffset = 3185041;



    SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
    Text key = new Text();
    Text value = new Text();

    rdr.seek(ByteOffset);
    rdr.next(key,value);
    //Plain text
    JSONObject jso = new JSONObject(value.toString());
    String content = jso.getString("body");
    System.out.println("\n\n\n" + content + "\n\n\n");

    File file =new File("test.gz");
    file.createNewFile();

}
  catch (Exception e ){
    throw new RuntimeException(e);

}
 finally{

 } 

  }

}

edited Jun 05 '14 at 15:53

answered Jun 05 '14 at 15:36

vikeng21

543
8
28

1

I think what had lead me to think seek is too expensive is this comment in O'Reilly's "Hadoop: The Definitive Guide" Ch. 3.5 "Finally, bear in mind that calling seek() is a relatively expensive operation and should be used sparingly. You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing a large number of seeks." – dbg Jun 05 '14 at 18:48
1

yes i can understand but as i have said that splitting the test file into smaller chunks and processing them using seek is good option. that is what we have done and process GB's of data using seek itself. whats the book says is kind of a suggestion and its up to you to decide which option you want to follow either use a built in api or write the same code in a different way – vikeng21 Jun 05 '14 at 19:03
I've begun to look into FileSplit assuming that this is the splitting you are referring to? – dbg Jun 05 '14 at 20:53
yup thats what i am pointing at. look into the file split option and use seek more effectively. i guess you headed in the right direction now :) – vikeng21 Jun 06 '14 at 05:32

How to read files with an offset from Hadoop using Java

1 Answers1

Linked