11

I am very new to Java, and trying to use Mathematica's Java interface to access a file using memory mapping (in hope of a performance improvement).

The Mathematica code I have is (I believe) equivalent to the following Java code (based on this):

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class MainClass {
  private static final int LENGTH = 8*100;

  public static void main(String[] args) throws Exception {
    MappedByteBuffer buffer = new FileInputStream("test.bin").getChannel().map(FileChannel.MapMode.READ_ONLY, 0, LENGTH);
    buffer.load();
    buffer.isLoaded(); // returns false, why?
  }
}

I would like to use the array() method on buffer, so I am trying to load the buffers contents into memory first using load(). However, even after load(), isLoaded() returns false, and buffer.array() throws an exception: java.lang.UnsupportedOperationException at java.nio.ByteBuffer.array(ByteBuffer.java:940).

Why doesn't the buffer load and how can I call the array() method?

My ultimate aim here is to get an array of doubles using asDoubleBuffer().array(). The method getDouble() does work correctly, but I was hoping to get this done in one go for good performance. What am I doing wrong?


As I am doing this from Mathematica, I'll post the actual Mathematica code I used too (equivalent to the above in Java):

Needs["JLink`"]
LoadJavaClass["java.nio.channels.FileChannel$MapMode"]
buffer = JavaNew["java.io.FileInputStream", "test.bin"]@getChannel[]@map[FileChannel$MapMode`READUONLY, 0, 8*100]

buffer@load[]
buffer@isLoaded[] (* returns False *)
Jonas
  • 121,568
  • 97
  • 310
  • 388
Szabolcs
  • 24,728
  • 9
  • 85
  • 174
  • "A return value of false does not necessarily imply that the buffer's content is not resident in physical memory." `load` is only best efforts, and indeed may have loaded the data into physical memory only for it to be immediately swapped out. – Tom Hawtin - tackline Dec 21 '11 at 15:38
  • @TomHawtin-tackline I think I misunderstood the purpose of `load` then. What I'd like to achieve is to get the contents of the buffer as an array of doubles. The `array` method unfrotunately doesn't work, and throws the exception I mentioned. I updated the question based on yuor feedback. – Szabolcs Dec 21 '11 at 15:40
  • 1
    `array` only works for buffers that are backed by an array (typically from `*Buffer.wrap`). – Tom Hawtin - tackline Dec 21 '11 at 15:46
  • @Szabolcs J/Link is using MathLink underneath for its operation. Therefore, importing a file to Mathematica through J/Link can not be faster than using Mathlink, which by itself can induce a pretty large overhead. If I understand correctly the source of your question, the main problem is not the load time for .mx files (I'd be hard pressed to see anything beating .mx load speed), but their coarse granularity. This should not matter much is every large .mx file only needs to be loaded once (in which case, this coarse granularity will be sufficient). If not, I'd create a file-system-type ... – Leonid Shifrin Dec 21 '11 at 19:01
  • @Szabolcs abstraction layer on top of much more fine-grained .mx files (which would serve as "clusters" for a large .mx file), so that we can get a good approximation to a random access .mx file and avoid re-reading unnecessary data. I think that would be the key. For a large data file (.mx or otherwise), I'd first convert it to such "composite" format (which may take a while but needs to be done only once). After that, onse should be able to work with it similar to a normal random-access file, and things should be fast then. – Leonid Shifrin Dec 21 '11 at 19:05
  • @Leonid That's true. But I still think the best solution is some very fast read operation (memory mapping the file?) together with Library Link. Library Link allows accessing the contents of a packed array *directly*, there's no data transfer/copy overhead whatsoever. – Szabolcs Dec 21 '11 at 22:48
  • @Leonid (Of course in C, not Java!) – Szabolcs Dec 21 '11 at 22:56
  • @Szabolcs Yes, I agree. I actually was toying with an idea to do something rather general for importing stuff based on this, for quite some time now. I have a few ideas of how to do it generally, but, as everything, it needs time (which I don't have currently :)) – Leonid Shifrin Dec 22 '11 at 14:33
  • @Leonid, I also don't and stopped with this. It does seem, based on a quick test, that simply using `fread()` to read a whole binary file is of comparable speed to `Get["file.mx"]`, and considerably faster than `BinaryReadList`, based on `AbsoluteTiming` (and not `Timing`!!) – Szabolcs Dec 22 '11 at 15:21
  • @Szabolcs Yes, I can believe that `fread` can be quite fast, but the main virtue of `Get` (or `Import`, as you found) with .mx file is that it reconstructs the Mathematica structures directly, by-passing the main evaluation loop, not unpacking packed arrays, etc. I don't know to what extent you can re-implement all that using LibraryLink, since from my cursory look at it some time ago I don't recall functions having direct low-level access to things like `DownValues` or other `...Values` - and anyway I suspect this would will lead to re-inventing the wheel. I can see two clearly distinct ... – Leonid Shifrin Dec 22 '11 at 15:27
  • @Leonid My use case here was reading purely numerical data, from one large file, with good performance and without having to read the whole file. MX files can only be written by Mathematica, and can only be read in full. A binary array could easily be written from any languague, and it would be possible to only read a part of it with this method (essentially a sped-up `BinaryReadList`, nothing more) ([motivation](http://stackoverflow.com/questions/8582795/faster-huge-data-import-than-getraggedmatrix-mx)) – Szabolcs Dec 22 '11 at 15:30
  • @Szabolcs projects here. One is to implement a fast generic import framework based on LibraryLink but targeted at medium-sized (perhaps hundreds of Megs) files and not based on .mx, and another one to work with huge files which can take advantage of .mx, where I would use .mx files as clusters in a "file-system" - type framework built on top of them, which may perhaps be built in the top-level Mathematica or Java (if the logic is in Java, it does not mean that the data transfer will go through J/Link. I've written a virtual file system in Java once and found Java a great vehicle for that). – Leonid Shifrin Dec 22 '11 at 15:31
  • @Szabolcs Yes, I realized where your question was coming from. For huge files, I'd still take the approach based on small .mx files as clusters - since this way we reuse all the work that was put into the .mx technology, and enjoy all the generality of .mx files. One would need to write a "file system" plus a converter which would convert a single large numerical file into a bunch of .mx files automatically. The performance of such a hybrid can be also made pretty good, I am sure. Whether or not it is easy to write a fast converter which can convert without loading full original numerical ... – Leonid Shifrin Dec 22 '11 at 15:38
  • @Szabolcs file into RAM, I don't know. – Leonid Shifrin Dec 22 '11 at 15:41
  • @Leonid I think what we really need for Mathematica is a way to process large data that doesn't fit into memory using the usual Mathematica paradigms (Map, Fold, Inner, Outer, [sparse data structures](http://stackoverflow.com/questions/8596134/efficient-alternative-to-outer-on-sparse-arrays-in-mathematica), or any other vectorized method). Part of this is a very fast "file system" of the kind you mention. I remember you have given answer in this spirit here, I should read them again when I have time. I have alwyas felt that M- is convenient only when we can keep everything in memory, and... – Szabolcs Dec 22 '11 at 17:11
  • ... it is not so great for sequential processing (such as read from input file, process it, write to output file). I really wish M- introduced a framework for larga data that needs to be kept on disk, but can be processed almost as conveniently as in-memory expressing. Maybe we could have a special List-like type that swaps out too-big parts to disk when necessary, but supports all list-operations? (Similarly to how `SparseArray` is a separate type yet supports all that `List` supports) I haven't thought this through though. I wish I had more time. – Szabolcs Dec 22 '11 at 17:13
  • @Szabolcs I feel your pain (both regarding the lack of time and the lack of the system for processing large files). I think this would be a terrific project. As for `Map` etc, completely agree - this is what I meant when I mentioned lazy structures built on top of this "file system" in my original comments to Rolf's question. Roman Maeder showed how to implement lazy lists in one of his "Mathematica programmer" books. Perhaps one could learn from that as well. But designing a system that would be both general and fast is a tough problem. Wish I had more time ...:) – Leonid Shifrin Dec 22 '11 at 17:31

2 Answers2

5

According to Javadoc
"The content of a mapped byte buffer can change at any time, for example if the content of the corresponding region of the mapped file is changed by this program or another. Whether or not such changes occur, and when they occur, is operating-system dependent and therefore unspecified.

All or part of a mapped byte buffer may become inaccessible at any time, for example if the mapped file is truncated. An attempt to access an inaccessible region of a mapped byte buffer will not change the buffer's content and will cause an unspecified exception to be thrown either at the time of the access or at some later time. It is therefore strongly recommended that appropriate precautions be taken to avoid the manipulation of a mapped file by this program, or by a concurrently running program, except to read or write the file's content."

To me it seems to many conditions and undesirable misbehavior. Do you need particularly this class?

If you just need to read file contents in fastest way, give a try:

FileChannel fChannel = new FileInputStream(f).getChannel();
    byte[] barray = new byte[(int) f.length()];
    ByteBuffer bb = ByteBuffer.wrap(barray);
    bb.order(ByteOrder.LITTLE_ENDIAN);
    fChannel.read(bb);

It works at speed almost equal to disk system test speed.

For double you can use DoubleBuffer (with double[] array if f.length()/4 size) or just call getDouble(int) method of ByteBuffer.

andrey
  • 842
  • 4
  • 6
  • I don't need this class in particular. I was hoping to be able to get the contents of part of a very large binary file (not all of it) as an array of doubles, in a faster way than what the builtin Mathematica functionality provides. – Szabolcs Dec 21 '11 at 15:46
  • Updated answer with "For double you can use DoubleBuffer (with double[] array of f.length()/4 size) or just call getDouble(int) method of ByteBuffer." You said you will read only some doubles. I'd use ByteBuffer to avoid possible conversion of unneeded bytes to doubles (in case of DoubleBuffer). – andrey Dec 21 '11 at 15:51
  • Thanks @andrey. I do need to get a `double []` eventually, as this is what gets auto-converted back to a Mathematica object (i.e. I can't just use `getDouble` with some index). If I try to read into a `DoubleBuffer` directly, it doesn't work. If I read into a `ByteBuffer`, it does work, and I can get it `asDoubleBuffer()`, but what I obtain this way returns `false` for `hasArray()` again, i.e. I can't convert it to a plain array of `double`s. Do you have any suggestions on how to get an array of double without explicitly looping through the whole thing and copying them out one by one? – Szabolcs Dec 21 '11 at 16:06
  • 3
    Alright, I managed to get it working with `.asDoubleBuffer().get(anArray)`, but it turns out to be very slow for large files, so I'm giving up in trying to use Java for this (since I have next to zero Java knowledge and this already took too long). Thanks for the help! – Szabolcs Dec 21 '11 at 16:29
  • Sorry for later response. I just checked out that default byte ordering in ByteBuffer is BIG_ENDIAN. For reading doubles we need LITTLE_ENDIAN. Perhaps it's too late, but anyway, I've fixed code with "bb.order(ByteOrder.LITTLE_ENDIAN);" More about converting http://stackoverflow.com/questions/6471177/converting-8-bytes-of-little-endian-binary-into-a-double-precision-float – andrey Dec 22 '11 at 07:28
  • Thanks @andrey, indeed I was not getting the same result back that I wrote out with another program, but I thought: no big deal, it's always possible to also *write* the same thing using Java and the endianness will match. But unfortunately it turns out this method is bound to be slow due to the data transfer between Java and Mathematica (even if the Java read on its own weren't slow..) – Szabolcs Dec 22 '11 at 08:01
0

in Java:

final byte[] hb;                  // Non-null only for heap buffers

so it is not even implemented for MappedByteBuffer but is for HeapByteBuffer.

in Android:

**
     * Child class implements this method to realize {@code array()}.
     *
     * @see #array()
     */
    abstract byte[] protectedArray();

and again not in MappedByteBuffer, but for example ByteArrayBuffer does implement the backing array.

 @Override byte[] protectedArray() {
    if (isReadOnly) {
      throw new ReadOnlyBufferException();
    }
    return backingArray;
  }

The point of memory map is to be off heap. A backing array would be on heap.
If you can get the FileChannel open from RandomAccessFile and then call map on the channel, you can also use the bulk get() method on the MappedByteBuffer to read into a byte[]. This copies from off heap, avoiding IO, into heap again.

buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
byte[] b = new byte[buffer.limit()];
buffer.get(b);
Droid Teahouse
  • 893
  • 8
  • 15