2

I want to read a large file in a very fast way. I am using MappedByteBuffer like this:

String line = "";

try (RandomAccessFile file2 = new RandomAccessFile(new File(filename), "r"))
        {

            FileChannel fileChannel = file2.getChannel();


            MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());


            for (int i = 0; i < buffer.limit(); i++)
            {
               char a = (char) buffer.get();
               if (a == '\n'){
                   System.out.println(line);  
                   line = "";
             }else{
                 line += Character.toString(c);


            }
        }

This is not working correctly. It is changing the content of the file and printing the changed content. Is there a better way to read a line of a file with MappedByteBuffer?

Eventually I would like to split the line and extract certain content (since its csv) so this is just a minimal example that reproduces the problem.

bcsta
  • 1,963
  • 3
  • 22
  • 61
  • You're not decoding the bytes into characters. See [`Charset`](https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/nio/charset/Charset.html) and [`CharsetDecoder`](https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/nio/charset/CharsetDecoder.html). – Slaw May 31 '19 at 10:46

1 Answers1

1

I made some tests using a 21 GB file filled with random strings, each line had a length of 20-40 characters. It seems like the builtin BufferedReader is still the fastest method.

File f = new File("sfs");
try(Stream<String> lines = Files.lines(f.toPath(), StandardCharsets.UTF_8)){
    lines.forEach(line -> System.out.println(line));
} catch (IOException e) {}

Reading the lines to a stream ensures you read the lines as you need them instead of reading the entire file at once.

To improve speed even further you can increase the buffer size of the BufferedReader by a moderate factor. In my tests it starter to outperform the normal buffer size at about 10 millions lines.

 CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
 int size = 8192 * 16;
 try (BufferedReader br = new BufferedReader(new InputStreamReader(newInputStream(f.toPath()), decoder), size)) {
        br.lines().limit(LINES_TO_READ).forEach(s -> {
     });
 } catch (IOException e) {
     e.printStackTrace();
 }

The code I used for testing:

private static long LINES_TO_READ = 10_000_000;

private static void java8Stream(File f) {

    long startTime = System.nanoTime();

    try (Stream<String> lines = Files.lines(f.toPath(), StandardCharsets.UTF_8).limit(LINES_TO_READ)) {
        lines.forEach(line -> {
        });
    } catch (IOException e) {
        e.printStackTrace();
    }

    long endTime = System.nanoTime();
    System.out.println("no buffer took " + (endTime - startTime) + " nanoseconds");
}

private static void streamWithLargeBuffer(File f) {
    long startTime = System.nanoTime();

    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
    int size = 8192 * 16;
    try (BufferedReader br = new BufferedReader(new InputStreamReader(newInputStream(f.toPath()), decoder), size)) {
        br.lines().limit(LINES_TO_READ).forEach(s -> {
        });
    } catch (IOException e) {
        e.printStackTrace();
    }

    long endTime = System.nanoTime();
    System.out.println("using large buffer took " + (endTime - startTime) + " nanoseconds");
}

private static void memoryMappedFile(File f) {
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();

    long linesReadCount = 0;
    String line = "";
    long startTime = System.nanoTime();

    try (RandomAccessFile file2 = new RandomAccessFile(f, "r")) {

        FileChannel fileChannel = file2.getChannel();
        MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0L, Integer.MAX_VALUE - 10_000_000);
        CharBuffer decodedBuffer = decoder.decode(buffer);

        for (int i = 0; i < decodedBuffer.limit(); i++) {
            char a = decodedBuffer.get();
            if (a == '\n') {
                line = "";
            } else {
                line += Character.toString(a);

            }
            if (linesReadCount++ >= LINES_TO_READ){
                break;
            }
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    long endTime = System.nanoTime();

    System.out.println("using memory mapped files took " + (endTime - startTime) + " nanoseconds");

}

Btw I noticed that FileChannel.map throws an exception if the mapped file is larger than Integer.MAX_VALUE which makes the method impractical for reading very large files.

r33tnup
  • 319
  • 2
  • 9