11

It could be generally stated: how do you implement a method byte[] get(offset, length) for a memory-mapped file that is bigger than 2GB in Java.

With context:

I'm trying to read efficiently files that are bigger than 2GB with random i/o. Of course the idea is to use Java nio and memory-mapped API.

The problem comes with the limit of 2GB for memory mapping. One of the solutions would be to map multiple pages of 2GB and index through the offset.

There's a similar solution here:

Binary search in a sorted (memory-mapped ?) file in Java

The problem with this solution is that it's designed to read byte while my API is supposed to read byte[] (so my API would be something like read(offset, length)).

Would it just work to change that ultimate get() to a get(offset, length)? What happens then when the byte[] i'm reading lays between two pages?

Community
  • 1
  • 1
marcorossi
  • 1,941
  • 2
  • 21
  • 34

1 Answers1

4

No, my answer to Binary search in a sorted (memory-mapped ?) would not work to change get() to get(offset, length) because of the memory mapped file array boundary, like you suspect. I can see two possible solutions:

  1. Overlap the memory mapped files. When you do a read, pick the memory mapped file with the start byte immediately before the read's start byte. This approach won't work for reads larger than 50% of the maximum memory map size.
  2. Create a byte array creation method that reads from two different two different memory mapped files. I'm not keen on this approach as I think some of the performance gains will be lost because the resulting array will not be memory mapped.
Community
  • 1
  • 1
Stu Thompson
  • 38,370
  • 19
  • 110
  • 156
  • 1
    What performance gains will be lost? If you're returning a `byte[]`, you're copying from the `mmap()`ed region anyway. Calling `System.arraycopy` twice instead of once on the same total number of bytes isn't that much worse. – Scott Lamb Sep 13 '11 at 20:14
  • @Scott Lamb: I agree that the performance hit would be negligible for those probably rare edge conditions when `get()` needs to read from two different maps in the "binary search" algo. My answer is saying you'll need to code around that, hence the two options. Just adding the offset with no new code behind `get()` will result in hard errors like index out of bounds errors. – Stu Thompson Sep 14 '11 at 06:27