4

I've spent considerable time attempting to optimize a file hashing algorithm to eek out every last drop of performance possible.

See my previous SO threads:

Get File Hash Performance/Optimization

FileChannel ByteBuffer and Hashing Files

Determining Appropriate Buffer Size

It was recommened several times to use Java NIO to gain native performance increases (by keeping the buffer's in the system instead of bringing them into the JVM). However, my NIO code runs considerably slower un benchmarks (hashing the same files over and over with each algorithm to negate any OS/Drive "magic" that could be skewing results.

I now have two methods that do the same thing:

This one runs faster almost every time:

/**
 * Gets Hash of file.
 * 
 * @param file String path + filename of file to get hash.
 * @param hashAlgo Hash algorithm to use. <br/>
 *     Supported algorithms are: <br/>
 *     MD2, MD5 <br/>
 *     SHA-1 <br/>
 *     SHA-256, SHA-384, SHA-512
 * @param BUFFER Buffer size in bytes. Recommended to stay in<br/>
 *          multiples of 2 such as 1024, 2048, <br/>
 *          4096, 8192, 16384, 32768, 65536, etc.
 * @return String value of hash. (Variable length dependent on hash algorithm used)
 * @throws IOException If file is invalid.
 * @throws HashTypeException If no supported or valid hash algorithm was found.
 */
public String getHash(String file, String hashAlgo, int BUFFER) throws IOException, HasherException {
    StringBuffer hexString = null;
    try {
        MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo));
        FileInputStream fis = new FileInputStream(file);

        byte[] dataBytes = new byte[BUFFER];

        int nread = 0;
        while ((nread = fis.read(dataBytes)) != -1) {
            md.update(dataBytes, 0, nread);
        }
        fis.close();
        byte[] mdbytes = md.digest();

        hexString = new StringBuffer();
        for (int i = 0; i < mdbytes.length; i++) {
            hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
        }

        return hexString.toString();

    } catch (NoSuchAlgorithmException | HasherException e) {
        throw new HasherException("Unsuppored Hash Algorithm.", e);
    }
}

My Java NIO method that runs considerably slower most of the time:

/**
 * Gets Hash of file using java.nio File Channels and ByteBuffer 
 * <br/>for native system calls where possible. This may improve <br/>
 * performance in some circumstances.
 * 
 * @param fileStr String path + filename of file to get hash.
 * @param hashAlgo Hash algorithm to use. <br/>
 *     Supported algorithms are: <br/>
 *     MD2, MD5 <br/>
 *     SHA-1 <br/>
 *     SHA-256, SHA-384, SHA-512
 * @param BUFFER Buffer size in bytes. Recommended to stay in<br/>
 *          multiples of 2 such as 1024, 2048, <br/>
 *          4096, 8192, 16384, 32768, 65536, etc.
 * @return String value of hash. (Variable length dependent on hash algorithm used)
 * @throws IOException If file is invalid.
 * @throws HashTypeException If no supported or valid hash algorithm was found.
 */
public String getHashNIO(String fileStr, String hashAlgo, int BUFFER) throws IOException, HasherException {

    File file = new File(fileStr);

    MessageDigest md = null;
    FileInputStream fis = null;
    FileChannel fc = null;
    ByteBuffer bbf = null;
    StringBuilder hexString = null;

    try {
        md = MessageDigest.getInstance(hashAlgo);
        fis = new FileInputStream(file);
        fc = fis.getChannel();
        bbf = ByteBuffer.allocateDirect(BUFFER); // allocation in bytes - 1024, 2048, 4096, 8192

        int b;

        b = fc.read(bbf);

        while ((b != -1) && (b != 0)) {
            bbf.flip();

            byte[] bytes = new byte[b];
            bbf.get(bytes);

            md.update(bytes, 0, b);

            bbf.clear();
            b = fc.read(bbf);
        }

        fis.close();

        byte[] mdbytes = md.digest();

        hexString = new StringBuilder();

        for (int i = 0; i < mdbytes.length; i++) {
            hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
        }

        return hexString.toString();

    } catch (NoSuchAlgorithmException e) {
        throw new HasherException("Unsupported Hash Algorithm.", e);
    }
}

My thoughts are that Java NIO attempts to use native system calls and such to keep processing and storage (buffers) in the system and out of the JVM - this prevents (in theory) the program from having to constantly shuffle things back and forth between the JVM and the system. In theory this should be faster... but perhaps my MessageDigest forces the JVM to bring the buffer in, negating any performance improvements the native buffers/system calls can bring? Am I correct in this logic or am I way off?

Please help me understand why Java NIO is not better in this scenario.

Community
  • 1
  • 1
SnakeDoc
  • 13,611
  • 17
  • 65
  • 97
  • 1
    NIO excels with concurrency. If you don't have concurrent execution it buys you only overhead in both - code and processing. I don't know what is the performance difference but might be that. – akostadinov May 01 '13 at 15:39
  • 3
    reading from the channel into a `ByteBuffer` and then again copying into a `byte[]` may hurt the performance in the nio method. NIO magic (besides the non blocking part) mainly kicks in, if the JVM can avoid the transport of data from system space (os, disk,...) into user space. As the hashing algorithm needs to read each byte of the file, there is obviously no shortcut available. If the performance of the Old-IO method is not sufficient consider using a profiler or test library implementations (e.g. [guava](http://code.google.com/p/guava-libraries/wiki/HashingExplained) for better performance – Pyranja May 01 '13 at 15:58
  • @Pyranja hmm, interesting info. I checked out the Guava library, unfortunately I don't forsee it yielding any performance increase over my methods above as they all rely on the default `java.security.MessageDigest` implementation for actually performing the hash... and if a file is too large to fix reasonably in a buffer (ex. a 10GB file), then it must be streamed through a buffer and we are in a full circle of having to do many I/O ops to stream through the buffer until we've hashed all bits... hmm.. – SnakeDoc May 01 '13 at 16:29
  • 1
    @akostadinov Does that hold true for all of NIO? There's more to it than nonblocking channels. – millimoose May 01 '13 at 18:45

1 Answers1

6

Two things which would probably make your NIO approach better:

  1. Try using a memory-mapped file instead of reading the data into heap memory.
  2. Pass the data to the digest using a ByteBuffer instead of a byte[] array.

The first should avoid copying the data between file cache and application heap, while the second should avoid copying the data between buffer and byte array. Without these optimizations, you'll likely have more copying that a naïve non-NIO approach has.

MvG
  • 57,380
  • 22
  • 148
  • 276