6

Using the following code as a benchmark, the system can write 10,000 rows to disk in a fraction of a second:

void withSync() {
    int f = open( "/tmp/t8" , O_RDWR | O_CREAT );
    lseek (f, 0, SEEK_SET );
    int records = 10*1000;
    clock_t ustart = clock();
    for(int i = 0; i < records; i++) {
        write(f, "012345678901234567890123456789" , 30);
        fsync(f);
    }
    clock_t uend = clock();
    close (f);
    printf("   sync() seconds:%lf   writes per second:%lf\n", ((double)(uend-ustart))/(CLOCKS_PER_SEC), ((double)records)/((double)(uend-ustart))/(CLOCKS_PER_SEC));
}

In the above code, 10,000 records can be written and flushed out to disk in a fraction of a second, output below:

sync() seconds:0.006268   writes per second:0.000002

In the Java version, it takes over 4 seconds to write 10,000 records. Is this just a limitation of Java, or am I missing something?

public void testFileChannel() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(new File("/tmp/t5"),"rw");
    FileChannel c = raf.getChannel();
    c.force(true);
    ByteBuffer b = ByteBuffer.allocateDirect(64*1024);
    long s = System.currentTimeMillis();
    for(int i=0;i<10000;i++){            
        b.clear();
        b.put("012345678901234567890123456789".getBytes());
        b.flip();
        c.write(b);
                    c.force(false);
    }
    long e=System.currentTimeMillis();
    raf.close();
    System.out.println("With flush "+(e-s));

}

Returns this:

With flush 4263

Please help me understand what is the correct/fastest way to write records to disk in Java.

Note: I am using the RandomAccessFile class in combination with a ByteBuffer as ultimately we need random read/write access on this file.

Jay
  • 19,649
  • 38
  • 121
  • 184
  • Your comparison isn't fair. You are using a ByteBuffer and calling .getBytes() in the Java version. If your idea is to test performance for your application then this is okay. But to compare to C this is unfair as you are doing different things. – dave Nov 09 '12 at 07:07
  • 1
    It's more than fair. Using a ByteBuffer and .getBytes is actually faster (in my tests on my machine at least) than doing it in Java in any other way. If you have other suggestions on how to do random access in Java I am very open to hear them. Thanks! – Jay Nov 09 '12 at 11:40

4 Answers4

5

Actually, I am surprised that test is not slower. The behavior of force is OS dependent but broadly it forces the data to disk. If you have an SSD you might achieve 40K writes per second, but with an HDD you won't. In the C example its clearly isn't committing the data to disk as even the fastest SSD cannot perform more than 235K IOPS (That the manufacturers guarantee it won't go faster than that :D )

If you need the data committed to disk every time, you can expect it to be slow and entirely dependent on the speed of your hardware. If you just need the data flushed to the OS and if the program crashes but the OS does not, you will not loose any data, you can write data without force. A faster option is to use memory mapped files. This will give you random access without a system call for each record.

I have a library Java Chronicle which can read/write 5-20 millions records per second with a latency of 80 ns in text or binary formats with random access and can be shared between processes. This only works this fast because it is not committing the data to disk on every record, but you can test that if the JVM crashes at any point, no data written to the chronicle is lost.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Good question with regards to if C really is flushing the data back to disk! (I'm not sure of a way to test that). I do know that the C code does run significantly faster if you stop flushing — so I know the C code is doing something different with the flush, I just don't know what exactly. – Jay Nov 09 '12 at 11:28
  • 1
    I would expect that flush is pushing out the buffer to the OS. Without the blush it might be buffering a few lines at a time. – Peter Lawrey Nov 09 '12 at 11:31
  • 1
    Your suggestions making a lot of sense! The geek in me wants to find a way to confirm this for sure... Maybe some tests that involve the power cable being disconnected (: – Jay Nov 09 '12 at 11:38
  • 1
    Try polling the file size and see in what multiples it grows. – Peter Lawrey Nov 09 '12 at 11:44
  • Excellent suggestion, I'll test that to see what that reveals. I've been reading up on IOPS. I wonder if the IOPS rating is faster if you are reading/writing to the same "area" of the disk. – Jay Nov 09 '12 at 11:59
  • 1
    talk to a filesystem kernel developer they do things like pull power cables out of machines all the time. It's the only way to simulate a power loss. – dave Nov 09 '12 at 14:05
  • 2
    If the data must be committed to disk then you want the sync system call: http://linux.die.net/man/2/sync – dave Nov 09 '12 at 14:07
  • I have updated the above question so that the C code uses the lower level `open()` and `fsync(fd)`. As it turns out, the impact on performance is negligable. Not sure what this means, perahps Perhaps `fsync()` is ignored in C on OS/X? – Jay Nov 12 '12 at 01:18
  • 2
    From the OS/X fsync(2) [man page](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/fsync.2.html): "For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLF-SYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail." – David Moles Sep 17 '13 at 18:06
1

This code is more similar to what you wrote in C. Takes only 5 msec on my machine. If you really need to flush after every write, it takes about 60 msec. Your original code took about 11 seconds on this machine. BTW, closing the output stream also flushes.

public static void testFileOutputStream() throws IOException {
  OutputStream os = new BufferedOutputStream( new FileOutputStream( "/tmp/fos" ) );
  byte[] bytes = "012345678901234567890123456789".getBytes();
  long s = System.nanoTime();
  for ( int i = 0; i < 10000; i++ ) {
    os.write( bytes );
  }
  long e = System.nanoTime();
  os.close();
  System.out.println( "outputstream " + ( e - s ) / 1e6 );
}
jackrabbit
  • 5,525
  • 1
  • 27
  • 38
  • Turning off flushing makes the code above execute in about 0.15 seconds on my machine (: The software we right needs to be able to guarantee that when it says data was saved, it really was saved. – Jay Nov 09 '12 at 11:23
  • So, with flushing, it's still only 60 msec... BTW, `fflush` does not actually write out to disc. What is your time with `fsync` for the C version? `fsync` is similar to `os.getFD().sync()` when you remove the `BufferedOutputStream` decoration. Syncing is really slow though: the test then takes 6 seconds here. – jackrabbit Nov 10 '12 at 11:42
  • Either way, this method doesn't support random file acces. Using fsync doesn't slow down the C code significantly. – Jay Nov 12 '12 at 21:46
  • @Jacob From the manpage of fsync: `Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.` You need to call fcntl with F_FULLFSYNC to be sure. – jackrabbit Nov 25 '12 at 20:37
0

Java equivalent of fputs is file.write("012345678901234567890123456789"); , you are calling 4 functions and just 1 in C, delay seems obvious

David Ranieri
  • 39,972
  • 7
  • 52
  • 94
  • That's not the reason that it's *5* orders of magnitude slower. There's something else that is resulting in a massive slow down – dave Nov 09 '12 at 06:34
  • I appreciate your effort to reply, however my tests indicate using `write()` then `flush()` or other `DirectFileAccess` methods are marginally slower. Either way, we are talking about stuff that is disk bound not CPU bound. I can find no java code that is faster than this. – Jay Nov 09 '12 at 06:44
  • 1
    @dave: Virtual Machine vs compiled ;) – David Ranieri Nov 09 '12 at 06:49
  • 1
    @DavidRF: Well if that's your view put that in your answer. Though I still don't think that your average JVM is the reason for 5 orders of magnitude slow down. Java programs are 10,000x slower than C? Quick tell the world not to write another line of Java!!! Either that or you are wrong. I'm going with the latter (because of Occam's razor). – dave Nov 09 '12 at 06:52
  • But really, 4seconds vs 0.001 seconds to write 10,000 records to a random access file? That's 4,000 times slower! Is java really that bad? – Jay Nov 09 '12 at 06:53
  • The reason for my question is I want to be proven wrong. I am hoping its possible to use java to write a class of applications that depend heavily on fast disk performance. I just am a little shocked the performance difference is that huge. – Jay Nov 09 '12 at 06:56
0

i think this is most similar to your C version. i think the direct buffers in your java example are causing many more buffer copies than the C version. this takes about 2.2s on my (old) box.

  public static void testFileChannelSimple() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(new File("/tmp/t5"),"rw");
    FileChannel c = raf.getChannel();
    c.force(true);
    byte[] bytes = "012345678901234567890123456789".getBytes();
    long s = System.currentTimeMillis();
    for(int i=0;i<10000;i++){
      raf.write(bytes);
      c.force(true);
    }
    long e=System.currentTimeMillis();
    raf.close();
    System.out.println("With flush "+(e-s));
  }
jtahlborn
  • 52,909
  • 5
  • 76
  • 118