6

I've got a moderately big set of data, about 800 MB or so, that is basically some big precomputed table that I need to speed some computation by several orders of magnitude (creating that file took several mutlicores computers days to produce using an optimized and multi-threaded algo... I do really need that file).

Now that it has been computed once, that 800MB of data is read only.

I cannot hold it in memory.

As of now it is one big huge 800MB file but splitting in into smaller files ain't a problem if it can help.

I need to read about 32 bits of data here and there in that file a lot of time. I don't know before hand where I'll need to read these data: the reads are uniformly distributed.

What would be the fastest way in Java to do my random reads in such a file or files? Ideally I should be doing these reads from several unrelated threads (but I could queue the reads in a single thread if needed).

Is Java NIO the way to go?

I'm not familiar with 'memory mapped file': I think I don't want to map the 800 MB in memory.

All I want is the fastest random reads I can get to access these 800MB of disk-based data.

btw in case people wonder this is not at all the same as the question I asked not long ago:

Java: fast disk-based hash set

Community
  • 1
  • 1
cocotwo
  • 1,285
  • 2
  • 13
  • 19
  • 3
    Is there no way to throw that data into a database, which is exactly optimized to do that sort of stuff? – Yuval Adam Feb 27 '10 at 09:49
  • I am assuming it is already sorted and you are doing binary [or interpolation] search on it, right? Also, if possible you could shove it into a DB, which is optimised for querying huge data sets, perf. will be much better. – Fakrudeen Feb 27 '10 at 09:51
  • 1
    Why can you not put it all into memory? Buying more memory is likely to be *much* cheaper than writing code to improve the situation - and it has the benefit of giving you more memory for other things too... 800MB really isn't a lot of memory these days. – Jon Skeet Feb 27 '10 at 09:58
  • You want to speed up your random access by an order of magnitude. Get more RAM, as Jon says, or if not possible use a solid state drive. – JRL Feb 27 '10 at 13:11
  • @Jon Skeet and JRL: sadly this is for something that is deployed on a lot of machines... – cocotwo Mar 02 '10 at 13:02
  • @cocotwo: And are those machines all massively short of memory? Would they not benefit in general? Seriously, getting more memory is likely to give you the best bang for the buck in general. – Jon Skeet Mar 02 '10 at 13:36

4 Answers4

6

800MB is not that much to load up and store in memory. If you can afford to have multicore machines ripping away at a data set for days on end, you can afford an extra GB or two of RAM, no?

That said, read up on Java's java.nio.MappedByteBuffer. It is clear from your comment "I think I don't want to map the 800 MB in memory" that the concept is not clear.

In a nut shell, a mapped byte buffer allows one to programmatically access the data as it were in memory, although it may be on disk or in memory--this is for the OS to decide, as Java's MBB is based on the OS's Virtual Memory subsystem. It is also nice and fast. You will also be able to access a single MBB from multiple threads safely.

Here are the steps I recommend you take:

  1. Instantiate a MappedByteBuffer that maps your data file to the MBB. The creation is kinda expensive, so keep it around.
  2. In your look up method...
    1. instantiate a byte[4] array
    2. call .get(byte[] dst, int offset, int length)
    3. the byte array will now have your data, which you can turn into a value

And presto! You have your data!

I'm a big fan of MBBs and have used them successfully for such tasks in the past.

user253751
  • 57,427
  • 7
  • 48
  • 90
Stu Thompson
  • 38,370
  • 19
  • 110
  • 156
2

RandomAccessFile (blocking) may help: http://java.sun.com/javase/6/docs/api/java/io/RandomAccessFile.html

You can also use FileChannel.map() to map a region of file to memory, then read the MappedByteBuffer.

See also: http://java.sun.com/docs/books/tutorial/essential/io/rafs.html

Konrad Garus
  • 53,145
  • 43
  • 157
  • 230
  • @Konrad Garus: ok but that doesn't really help me much :( What I'd like to know is what is the fastest way to random reads in a 800MB read-only file (possibly from multiple threads). – cocotwo Feb 27 '10 at 09:41
  • Offhand I think that nio (last link) and RandomAccessFile have similar performance, but use different APIs. nio API is a bit more complex, but it can be non-blocking. Both would require a synchronized wrapper for thread safety. – Konrad Garus Feb 27 '10 at 10:10
1

Actually 800 MB isn't very big. If you have 2 GB of memory or more, it can reside in disk cache if not in your application itself.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

For the write case, on Java 7, AsynchronousFileChannel should be looked at.

When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel (on a 10GB file, 160 byte records, completely random writes, some random content, several hundred iterations of benchmarking loop to achieve steady state, roughly 5,300 writes per second).

My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.

I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.

Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).

Will be interesting to see if the same ratios hold on Linux.

Ross Judson
  • 1,132
  • 5
  • 11