7

I built a file hashing method in java that takes input string representation of a filepath+filename and then calculates the hash of that file. The hash can be any of the native supported java hashing algo's such as MD2 through SHA-512.

I am trying to eek out every last drop of performance since this method is an integral part of a project I'm working on. I was advised to try using FileChannel instead of a regular FileInputStream.

My original method:

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String file, String hashAlgo) throws IOException, HashTypeException {
        StringBuffer hexString = null;
        try {
            MessageDigest md = MessageDigest.getInstance(validateHashType(hashAlgo));
            FileInputStream fis = new FileInputStream(file);

            byte[] dataBytes = new byte[1024];

            int nread = 0;
            while ((nread = fis.read(dataBytes)) != -1) {
                md.update(dataBytes, 0, nread);
            }
            fis.close();
            byte[] mdbytes = md.digest();

            hexString = new StringBuffer();
            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException | HashTypeException e) {
            throw new HashTypeException("Unsuppored Hash Algorithm.", e);
        }
    }

Refactored method:

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {

        File file = new File(fileStr);

        MessageDigest md = null;
        FileInputStream fis = null;
        FileChannel fc = null;
        ByteBuffer bbf = null;
        StringBuilder hexString = null;

        try {
            md = MessageDigest.getInstance(hashAlgo);
            fis = new FileInputStream(file);
            fc = fis.getChannel();
            bbf = ByteBuffer.allocate(1024); // allocation in bytes

            int bytes;

            while ((bytes = fc.read(bbf)) != -1) {
                md.update(bbf.array(), 0, bytes);
            }

            fc.close();
            fis.close();

            byte[] mdbytes = md.digest();

            hexString = new StringBuilder();

            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException e) {
            throw new HasherException("Unsupported Hash Algorithm.", e);
        }
    }

Both return a correct hash, however the refactored method only seems to cooperate on small files. When i pass in a large file, it completely chokes out and I can't figure out why. I'm new to NIO so please advise.

EDIT: Forgot to mention I'm throwing SHA-512's through it for testing.

UPDATE: Updating with my now current method.

    /**
     * Gets Hash of file.
     * 
     * @param file String path + filename of file to get hash.
     * @param hashAlgo Hash algorithm to use. <br/>
     *     Supported algorithms are: <br/>
     *     MD2, MD5 <br/>
     *     SHA-1 <br/>
     *     SHA-256, SHA-384, SHA-512
     * @return String value of hash. (Variable length dependent on hash algorithm used)
     * @throws IOException If file is invalid.
     * @throws HashTypeException If no supported or valid hash algorithm was found.
     */
    public String getHash(String fileStr, String hashAlgo) throws IOException, HasherException {

        File file = new File(fileStr);

        MessageDigest md = null;
        FileInputStream fis = null;
        FileChannel fc = null;
        ByteBuffer bbf = null;
        StringBuilder hexString = null;

        try {
            md = MessageDigest.getInstance(hashAlgo);
            fis = new FileInputStream(file);
            fc = fis.getChannel();
            bbf = ByteBuffer.allocateDirect(8192); // allocation in bytes - 1024, 2048, 4096, 8192

            int b;

            b = fc.read(bbf);

            while ((b != -1) && (b != 0)) {
                bbf.flip();

                byte[] bytes = new byte[b];
                bbf.get(bytes);

                md.update(bytes, 0, b);

                bbf.clear();
                b = fc.read(bbf);
            }

            fis.close();

            byte[] mdbytes = md.digest();

            hexString = new StringBuilder();

            for (int i = 0; i < mdbytes.length; i++) {
                hexString.append(Integer.toHexString((0xFF & mdbytes[i])));
            }

            return hexString.toString();

        } catch (NoSuchAlgorithmException e) {
            throw new HasherException("Unsupported Hash Algorithm.", e);
        }
    }

So I attempted a benchmark hashing out the MD5 of a 2.92GB file using my original example and my latest update's example. Of course any benchmark is relative since there is OS and disk caching and other "magic" going on that will skew repeated reads of the same files... but here's a shot at some benchmarks. I loaded each method up and fired it off 5 times after compiling it fresh. The benchmark was taken from the last (5th) run as this would be the "hottest" run for that algorithm, and any "magic" (in my theory anyways).

Here's the benchmarks so far: 

    Original Method - 14.987909 (s) 
    Latest Method - 11.236802 (s)

That is a 25.03% decrease in time taken to hash the same 2.92GB file. Pretty good.

SnakeDoc
  • 13,611
  • 17
  • 65
  • 97
  • 2
    Why not use [MessageDigest.update(ByteBuffer)](http://docs.oracle.com/javase/6/docs/api/java/security/MessageDigest.html#update%28java.nio.ByteBuffer%29) method that takes `ByteBuffer` directly instead of using the backing array? – prunge Apr 17 '13 at 05:37
  • Just wanted to add for future visitors -- if you switch to using `ByteBuffer.allocateDirect()` then there is no backing array and `ByteBuffer.array()` will `fail`. Instead switch over to using `MessageDigest.update(ByteBuffer)` per @prunge advice. This is not only more efficient, but cleaner than trying to read the buffer to some array then pass that array into `MessageDigest.update()`. – SnakeDoc Apr 17 '13 at 14:50
  • 1
    @SnakeDoc I found a bug in your stringification code for the hash numbers. Also presented a memory mapped implementation, which could be adapted to larger files by mmapping multiple times over the file in increments smaller than 2GB. – BillRobertson42 Apr 01 '14 at 21:47
  • I believe this is incorrect: Integer.toHexString((0xFF & mdbytes[i])) leading zeros will be dropped for 0x00-0x0F – swpalmer Oct 18 '16 at 13:54
  • 1
    The code does not produce a valid hash code. Indeed leading zeros are dropped. `String.format("%02x", mdbytes[i])` should be used instead of `Integer.toHexString(0xFF & mdbytes[i])` to avoid this. – user3224237 May 15 '17 at 15:11
  • @prunge and @SnakeDoc be aware that `MessageDigest.update(ByteBuffer)` simply fetches the buffer backing `array()` and then passes that array to its own `update()`. And if the ByteBuffer doesn't _have_ an array, it creates a temporary array of its own then repeatedly calls `buffer.get(temp, 0, tempsize)` and `update(temp, 0, tempsize)`. So there's no magical benefit when calculating digests to pass the `ByteBuffer`; it's cleaner to read certainly, but it's not inherently faster or avoiding bulk copying. – Ti Strga Dec 15 '22 at 20:11
  • @TiStrga that may be the case for the default implementation in engineUpdate() (`MessageDigest.update(ByteBuffer)` -> `engineUpdate(ByteBuffer)`) but subclasses can and do override engineUpdate() and have a more efficient implementation that uses the buffer directly, for example in `P11Digest.engineUpdate()`. – prunge Dec 15 '22 at 23:28
  • Oh sure, it's possible to use the buffer directly (and even `array()` is just a direct pointer, not a copy, so in the base case it's still okay), but it's not going to be a huge impact for most files The whole point after all is to get the data _into memory_ to perform the hashing on it, so the "fast I/O copying" aspect of memory-mapped files is only available for special hardware not needing an in-memory representation like the PKCS#11 you mentioned. – Ti Strga Dec 16 '22 at 16:18

3 Answers3

3

3 suggestions:

1) clear buffer after each read

while (fc.read(bbf) != -1) {
    md.update(bbf.array(), 0, bytes);
    bbf.clear();
}

2) do not close both fc and fis, it's redundant, closing fis is enough. FileInputStream.close API says:

If this stream has an associated channel then the channel is closed as well.

3) if you want performance improvement with FileChannel use

ByteBuffer.allocateDirect(1024); 
Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • 1
    If using `ByteBuffer.allocateDirect(1024);` then `ByteBuffer.array()` calls will fail with `UnsupportedOperationException`. – prunge Apr 17 '13 at 05:35
  • @prunge I ran into this trouble a little bit... I think I solved that bit up in my latest update... how's it looking now? – SnakeDoc Apr 17 '13 at 05:43
  • 1
    Yes, but you can use ByteBuffer.get(byte[]) – Evgeniy Dorofeev Apr 17 '13 at 05:43
  • @EvgeniyDorofeev just added to my latest update -- i had a brain dead moment and didn't realize I already used get(byte[]) – SnakeDoc Apr 17 '13 at 05:45
  • 2
    (4) Use a much bigger buffer. 1024 is far too small and has been since about 1996. Use at least 4096, preferably more like 64k. – user207421 Apr 17 '13 at 08:25
  • @EJP I now have it set to 8192 bytes (8k) ... I had read some stuff on another old SO thread that suggested using either 4k or 8k since that was likely native block size and anything larger would not have any real effect once the buffer overheard is no longer an issue (ie setting to buffer size of 1GB would yield no performance increase, an might actually decrease performance in some situations over 128k or 12k or 8k... etc). Can you point me in a direction with some info about properly selecting buffer sise? We can dynamically choose the buffer size depending on file size, etc if needed. – SnakeDoc Apr 17 '13 at 14:46
  • 2
    I think you've already found what you need. I don't have any problem with 8192 as a buffer size, and I agree that it can be too large as well as too small. Probably the ideal size is the disk cluster size, but you can't get that from Java as far as I know. I don't see any reason why using a direct buffer is a good idea here. It's only beneficial if you are just copying data and don't look at it in the Java code. Using get() negates all the advantages of a direct buffer. – user207421 Oct 06 '13 at 21:56
1

Another possible improvement might come if the code only allocated the temp buffer once.

e.g.

        int bufsize = 8192;
        ByteBuffer buffer = ByteBuffer.allocateDirect(bufsize); 
        byte[] temp = new byte[bufsize];
        int b = channel.read(buffer);

        while (b > 0) {
            buffer.flip();

            buffer.get(temp, 0, b);
            md.update(temp, 0, b);
            buffer.clear();

            b = channel.read(buffer);
        }

Addendum

Note: There is a bug in the string building code. It prints zero as a single digit number. This can easily be fixed. e.g.

hexString.append(mdbytes[i] == 0 ? "00" : Integer.toHexString((0xFF & mdbytes[i])));

Also, as an experiment, I rewrote the code to use mapped byte buffers. It runs about 30% faster (6-7 millis v.s. 9-11 millis FWIW). I expect you could get more out of it if you wrote code hashing code that operated directly on the byte buffer.

I attempted to account for JVM initialization and file system caching by hashing a different file with each algorithm before starting the timer. The first run through the code is about 25 times slower than a normal run. This appears to be due to JVM initialization, because all runs in the timing loop are roughly the same length. They do not appear to benefit from caching. I tested with the MD5 algorithm. Also, during the timing portion, only one algorithm is run for the duration of the test program.

The code in the loop is shorter, so potentially more understandable. I'm not 100% certain what kind of pressure memory mapping many files under high volume would exert on the JVM, so that might be something you would need to research and consider if you wanted to consider this sort of solution if you wanted to run this under load.

public static byte[] hash(File file, String hashAlgo) throws IOException {

    FileInputStream inputStream = null;

    try {
        MessageDigest md = MessageDigest.getInstance(hashAlgo);
        inputStream = new FileInputStream(file);
        FileChannel channel = inputStream.getChannel();

        long length = file.length();
        if(length > Integer.MAX_VALUE) {
            // you could make this work with some care,
            // but this code does not bother.
            throw new IOException("File "+file.getAbsolutePath()+" is too large.");
        }

        ByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, length);

        int bufsize = 1024 * 8;          
        byte[] temp = new byte[bufsize];
        int bytesRead = 0;

        while (bytesRead < length) {
            int numBytes = (int)length - bytesRead >= bufsize ? 
                                         bufsize : 
                                         (int)length - bytesRead;
            buffer.get(temp, 0, numBytes);
            md.update(temp, 0, numBytes);
            bytesRead += numBytes;
        }

        byte[] mdbytes = md.digest();
        return mdbytes;

    } catch (NoSuchAlgorithmException e) {
        throw new IllegalArgumentException("Unsupported Hash Algorithm.", e);
    }
    finally {
        if(inputStream != null) {
            inputStream.close();
        }
    }
}
BillRobertson42
  • 12,602
  • 4
  • 40
  • 57
-1

Here are an example for File hashing using NIO

  • Path
  • FileChanngel
  • MappedByteBuffer

And avoiding use of byte[]. So this i think should an improved version of the above. And an second nio example where the hashed value is stored in user attributes. That can be used for HTML etag generation an other samples there the file does not change.

    public static final byte[] getFileHash(final File src, final String hashAlgo) throws IOException, NoSuchAlgorithmException {
    final int         BUFFER = 32 * 1024;
    final Path        file = src.toPath();
    try(final FileChannel fc   = FileChannel.open(file)) {
        final long        size = fc.size();
        final MessageDigest hash = MessageDigest.getInstance(hashAlgo);
        long position = 0;
        while(position < size) {
            final MappedByteBuffer data = fc.map(FileChannel.MapMode.READ_ONLY, 0, Math.min(size, BUFFER));
            if(!data.isLoaded()) data.load();
            System.out.println("POS:"+position);
            hash.update(data);
            position += data.limit();
            if(position >= size) break;
        }
        return hash.digest();
    }
}

public static final byte[] getCachedFileHash(final File src, final String hashAlgo) throws NoSuchAlgorithmException, FileNotFoundException, IOException{
    final Path path = src.toPath();
    if(!Files.isReadable(path)) return null;
    final UserDefinedFileAttributeView view = Files.getFileAttributeView(path, UserDefinedFileAttributeView.class);
    final String name = "user.hash."+hashAlgo;
    final ByteBuffer bb = ByteBuffer.allocate(64);
    try { view.read(name, bb); return ((ByteBuffer)bb.flip()).array();
    } catch(final NoSuchFileException t) { // Not yet calculated
    } catch(final Throwable t) { t.printStackTrace(); }
    System.out.println("Hash not found calculation");
    final byte[] hash = getFileHash(src, hashAlgo);
    view.write(name, ByteBuffer.wrap(hash));
    return hash;
}
SkateScout
  • 815
  • 14
  • 24
  • 1
    You should map the entire file once, instead of all those bits. You don't seem to be aware of the memory/collection problems associated with using lots of mapped buffers. – user207421 Oct 06 '13 at 21:57
  • @EJP for completeness, can you elaborate or link on why it may be bad in this scenario? – SnakeDoc Oct 07 '13 at 17:54
  • 1) Is there no problem when map an large file at once into memory ? 2) Is there any way to "move" the mapped window ? – SkateScout Oct 08 '13 at 20:53