1

I have created a method that reads specific lines from a file based on their line number. It works fine for most files but when I try to read a file that contains a large number of really long lines then it takes ages, particularly as it gets further down in the file. I've also done some debugging and it appears to take a lot of memory as well but I'm not sure if this is something that can be improved. I know there are some other questions which focus on how to read certain lines from a file but this question is focussed primarily on the performance aspect.

public static final synchronized List<String> readLines(final File file, final Integer start, final Integer end) throws IOException {
        BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
        List<String> lines = new ArrayList<>();
        try {
            String line = bufferedReader.readLine();
            Integer currentLine = 1;
            while (line != null) {
                if ((currentLine >= start) && (currentLine <= end)) {
                    lines.add(line + "\n");
                }
                currentLine++;
                if (currentLine > end) {
                    return lines;
                }
                line = bufferedReader.readLine();
            }
        } finally {
            bufferedReader.close();
        }
        return lines;
    }

How can I optimize this method to be faster than light?

Hooli
  • 1,135
  • 3
  • 19
  • 46
  • 1
    Try [this](http://stackoverflow.com/a/29637400/276052) but use something like `.limit(end - start)` and `.collect(Collectors.toList())`. – aioobe Oct 04 '16 at 16:29
  • Possibility => https://stackoverflow.com/a/9174095/2877364 – cxw Oct 04 '16 at 16:29
  • 1
    You might be able to seek to the `start` line, then only read `end - start` lines – OneCricketeer Oct 04 '16 at 16:29
  • @aioobe their approach is even slower than this. They read entire file. – talex Oct 04 '16 at 16:30
  • 2
    Possible duplicate of [How can I get a specific line from a text file?](http://stackoverflow.com/questions/29637370/how-can-i-get-a-specific-line-from-a-text-file) – cxw Oct 04 '16 at 16:30
  • @talex What about `if (currentLine > end) return lines`? – OneCricketeer Oct 04 '16 at 16:31
  • I see two solution for you: you buy faster hard drive or you index your files somehow. – talex Oct 04 '16 at 16:31
  • 1
    One small optimization you can make is increase the size of the `BufferedReader` so that it holds more data in memory. For example, use `new BufferedReader(file, 128*1024)` instead of the default. – Mike Laren Oct 04 '16 at 16:34
  • @aioobe: Could you possibly demonstrate by any chance? – Hooli Oct 04 '16 at 16:39
  • @talex, what makes you believe they read the entire file? – aioobe Oct 04 '16 at 17:26
  • @aioobe I think `Files.lines(Paths.get("yourFile.txt"))` read entire file. – talex Oct 04 '16 at 17:31
  • @talex, no, it doesn't. `Files.lines` produces a `Stream` and the lines aren't read from the file unless the terminal operation requests them. – aioobe Oct 04 '16 at 17:42
  • 1
    @Hooli, something like this (note: untested code) `List lines = Files.lines(Paths.get("yourfile.txt")).skip(start).limit(end - start).collect(Collectors.toList());`. But I doubt that it will be much faster than the solution you already have. – aioobe Oct 04 '16 at 17:44
  • You really need to run this through a profiler and see what BufferedReader is up to, eg I suspect that it spends time copying characters from input buffer into strings, most of which are then thrown away. I suspect you'd be better using something that does less processing, eg FileInputStream, and handle the character/line parsing yourself. – matt helliwell Oct 04 '16 at 17:58
  • @aioobe: I just tested it now and noticed a massive decrease in speed. I'm not sure what the guys at Oracle are smoking lately. – Hooli Oct 04 '16 at 18:00
  • @matthelliwell: Care to demonstrate? – Hooli Oct 04 '16 at 18:00
  • @Hooli Unfortunately I've got no time tonight, hence the comment rather than a proper answer. It's going to be something like `FileInputStream in = new FileInputStream("myfile.csv"); byte[] chars = new byte[MY_BUFFER_SIZE]; int count = in.read(chars);` to read characters directly into your buffer with no copying. Then followed by some careful parsing for new lines. You can probably copy most of it from BufferedReader but you'll be operating directly on your own buffer so can eliminate all array and string copying. – matt helliwell Oct 04 '16 at 18:13
  • @MikeLaren: Your buffer size suggestion helped a lot. It made it four times faster. – Hooli Oct 05 '16 at 10:15

1 Answers1

0

I realised that what I was doing before was inherently slow and used up too much memory.

By adding all lines to memory and then processing all lines in a List it was not only taking twice as long but was also creating String variables for no reason.

I am now using Java 8 Stream and processing at point of reading which is the fastest method I've used so far.

Path path = Paths.get(file.getAbsolutePath());
Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8);
        for (String line : (Iterable<String>) stream::iterator) {
        //do stuff
        }   
}
Hooli
  • 1,135
  • 3
  • 19
  • 46
  • 1
    This is not inherently different from what you were doing in the original example. So something else was going possibly. Rather than guessing what the problem is, you should try profiling where it is actually spending its time using visualvm or even jtop. Your code looks fine, the only reason I can think of it would underperform is because it starts garbage collecting if your heap is too small. Also that synchronized is suspicious, do you have lots of methods calling this thing from multiple threads? – Jilles van Gurp Oct 11 '16 at 16:07
  • @JillesvanGurp: I thought the same regarding `synchronized` but after taking it out noticed no difference. There were no threads, I just thought it was good practice to have it on utility methods. It's a bit difficult to profile that kind of method because all the important bits happen behind the scenes, in the jvm code. – Hooli Oct 11 '16 at 16:27