6

I am really in trouble: I want to read HUGE files over several GB using FileChannels and MappedByteBuffers - all the documentation I found implies it's rather simple to map a file using the FileChannel.map() method. Of course there is a limit at 2GB as all the Buffer methods use int for position, limit and capacity - but what about the system implied limits below that?

In reality, I get lots of problems regarding OutOfMemoryExceptions! And no documentation at all that really defines the limits! So - how can I map a file that fits into the int-limit safely into one or several MappedByteBuffers without just getting exceptions?

Can I ask the system which portion of a file I can safely map before I try FileChannel.map()? How? Why is there so little documentation about this feature??

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
Zordid
  • 10,451
  • 11
  • 42
  • 58

4 Answers4

9

I can offer some working code. Whether this solves your problem or not is difficult to say. This hunts through a file for a pattern recognised by the Hunter.

See the excellent article Java tip: How to read files quickly for the original research (not mine).

// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];

// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
  // Use a mapped and buffered stream for best speed.
  // See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
  FileChannel ch = f.getChannel();
  long red = 0L;
  do {
    long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
    MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
    int nGet;
    while (mb.hasRemaining() && p.ok()) {
      nGet = Math.min(mb.remaining(), SIZE);
      mb.get(buffer, 0, nGet);
      for (int i = 0; i < nGet && p.ok(); i++) {
        p.check(buffer[i]);
      }
    }
    red += read;
  } while (red < ch.size() && p.ok());
  // Finish off.
  p.close();
  ch.close();
  f.close();
}
OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
7

What I use is a List<ByteBuffer> where each ByteBuffer maps to the file in block of 16 MB to 1 GB. I uses powers of 2 to simplify the logic. I have used this to map in files up to 8 TB.

A key limitation of memory mapped files is that you are limited by your virtual memory. If you have a 32-bit JVM you won't be able to map in very much.

I wouldn't keep creating new memory mappings for a file because these are never cleaned up. You can create lots of these but there appears to be a limit of about 32K of them on some systems (no matter how small they are)

The main reason I find MemoryMappedFiles useful is that they don't need to be flushed (if you can assume the OS won't die) This allows you to write data in a low latency way, without worrying about losing too much data if the application dies or too much performance by having to write() or flush().

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    I have [created a demo for your idea](http://stackoverflow.com/a/34109746/4563974). Thanks, your advise is really useful. I just do not understand closing the buffers part. You may not need to close files if they are buffered in windows and it can supply data from cache rather than hard disk for another user. But, the problem is that another user cannot open the file unless you closed it. I am getting a lot of `cannot access the file` when re-run my program many times from scala build tool. – Valentin Tihomirov Dec 05 '15 at 20:54
  • 1
    @ValentinTihomirov You need to clean up buffers for memory mapped files or the file will remain locked on Windows. – Peter Lawrey Dec 05 '15 at 21:19
  • 2
    `buf = null ; System.gc` you mean? – Valentin Tihomirov Dec 05 '15 at 22:27
3

You don't use the FileChannel API to write the entire file at once. Instead, you send the file in parts. See example code in Martin Thompson's post comparing performance of Java IO techniques: Java Sequential IO Performance

In addition, there is not much documentation because you are making a platform-dependent call. from the map() JavaDoc:

Many of the details of memory-mapped files are inherently dependent upon the underlying operating system and are therefore unspecified.

noahlz
  • 10,202
  • 7
  • 56
  • 75
  • 2
    Martin Thompson is just great. Note that FileChannel is actually *not* significantly faster for the 8GB file(!) - but as Martin says "Your Mileage May Vary." – noahlz Sep 21 '12 at 14:13
  • 1
    Mapped IO is much faster when you access individual words rather than streams. I have seen that running [that code](http://stackoverflow.com/questions/34097130/34109746#34109746). It was 50 MB/sec for mem-mapped io and 300 times slower for using raf directly. – Valentin Tihomirov Dec 05 '15 at 21:15
  • However, RAF is slightly faster for random access. if file is only 1/10 of your system memory, then, for millions of requests tested, MM approaches 14 ns/request whereas raf is 1000x times slower, 11 us/req. However, memory-mapped search takes 5ms if file is larger than main mem whereas RAF approaches 4.8 ms for more than 20k random accesses. The difference is because raf always accesses the file, no matter what, whereas MMF provides stronger caching. Both have the same access time but raf reads only one needed word whereas MMF reads whole 4k page, thus becoming slightly slower at long files. – Valentin Tihomirov Dec 06 '15 at 08:36
  • 1
    Sorry, MMF reads random longs within 1GB file at speed 140 ns, which is only 100 times faster than RAF. The funny thing is that RAF reads single byte 8 times when you request readLong from it. – Valentin Tihomirov Dec 06 '15 at 11:49
2

The bigger the file, the less you want it all in memory at once. Devise a way to process the file a buffer at a time, a line at a time, etc.

MappedByteBuffers are especially problematic, as there is no defined release of the mapped memory, so using more than one at a time is essentially bound to fail.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • I totally went into the wrong direction wanting to map the whole file, you are right. I misunderstood the concept of mapping a file into memory, thought that everything was just virtual and the OS would load pages into memory whenever needed... but for giga or terrabyte files that simply does not work at all. – Zordid Dec 06 '12 at 16:30
  • 1
    @Zordid It does load pages as needed, but it maps the memory all at once, and that requires swap space, allocation of memory addresses, ... all precious resources with 'no defined release' time. – user207421 Dec 06 '12 at 18:49