1

I am reading a huge text file of words (one word per line) but I have to stop it from time to time to resume the read the next day. Right now I'm using Apache's lineiterator but it's totally the wrong solution. My file is 7Gb and I had to interrupt reading it around at 1Gb. To resume the read I saved the number of line already read. This means that I have an if statement on the while loop. Apache's FileUtils doesn't allow to seek so that was my solution.

What is the best/fastest solution? I thought to use RandomAccessfile to get to the right line and continue reading, but I'm not sure if I can go to the right place AND how do I save the correct place I read last. I can reead again a couple of lines, so the precision is not so important, but I haven't found a way to get the pointer. I have a BufferedReader to read the File and a RandomAccessFile to seek to the right place, but I don't know how to periodically save a position with the BufferedReader. Any hints?

Code: (note the "SOMETHING" where I should print the value I can use on the seekToByte )

try {

        RandomAccessFile rand = new RandomAccessFile(file,"r");
        rand.seek(seekToByte);
        startAtByte = rand.getFilePointer();
        rand.close();

    } catch(IOException e) {
        // do something
    }

    // Do it using the BufferedReader 
    BufferedReader reader = null;
    FileReader freader = null;
    try {
        freader = new FileReader(file);
        reader = new BufferedReader(freader);
        reader.skip(startAtByte);

        long i=0;
        for(String line; (line = reader.readLine()) != null; ) {

            lines.add(line);
            System.out.print(i+" ");
            if (lines.size()>1000) {
                commit(lines);
                System.out.println("");
                lines.clear();
                System.out.println(SOMETHING?);
            }
        }

    } catch(Exception e) {
        // handle this           
    } finally {
        if (reader != null) {
            try {reader.close();} catch(Exception ignore) {}
        }
    }
maugch
  • 1,276
  • 3
  • 22
  • 46
  • Ok, do you have the code so far ? – Caffeinated Nov 17 '15 at 18:56
  • Are you keeping track of the pointer anywhere? As in, do you keep a count (preferably a `BigInteger`) of the number of lines you've read? – Makoto Nov 17 '15 at 18:59
  • @Coffee https://bitsofinfo.wordpress.com/2009/04/15/how-to-read-a-specific-line-from-a-very-large-file-in-java/ Here is the code I read before writing this post. – maugch Nov 17 '15 at 19:00
  • Refer http://stackoverflow.com/questions/4121678/java-read-last-n-lines-of-a-huge-file Considering you have decided on how to store the count of lines already processed, the above may give some direction. – ram Nov 17 '15 at 19:00
  • @Makoto i used a Long to store the line count with FileUtils.lineiterator, but that's not useful.it's been 20minutes that's going on jumping around and my counter was just 42541917 – maugch Nov 17 '15 at 19:01

2 Answers2

2

RandomAccessfile is indeed one way to go. Use

long position = file.getFilePointer();

When you stop reading to save where you are in the file, and then restore with:

file.seek(position);

To resume reading at the same place.

However, be careful when using RandomAccessfile, as its readLine method does not completely support Unicode.

njzk2
  • 38,969
  • 7
  • 69
  • 107
0

Can you somehow use predetermined offsets, for instance chop the file into four pieces (offset0, offset1) (offset1, offset2)..etc, and use RecursiveAction (ForkJoin API) to take advantage of parallelism.

user1529412
  • 3,616
  • 7
  • 26
  • 42
  • I"ll definitely split the source file next time, but I would prefer not to do it now since I'll lose 4 hours of processing I've already done. – maugch Nov 17 '15 at 20:34
  • Ok, I haven't done much Java lately, but if I were doing this in C/C++, I typically would have a placeholder for a line number where you stop, say you read 5000000 (5 million lines), you can even save it as first thing in your file if you could, so next time, you can just read the first bytes of the file to know where you can pick up from. – user1529412 Nov 17 '15 at 21:36