17

I'm working on creating the MD5 Checksum for large video files. I'm currently using the code:

extension NSData {
func MD5() -> NSString {
    let digestLength = Int(CC_MD5_DIGEST_LENGTH)
    let md5Buffer = UnsafeMutablePointer<CUnsignedChar>.allocate(capacity: digestLength)

    CC_MD5(bytes, CC_LONG(length), md5Buffer)
    let output = NSMutableString(capacity: Int(CC_MD5_DIGEST_LENGTH * 2))
    for i in 0..<digestLength {
        output.appendFormat("%02x", md5Buffer[i])
    }

    return NSString(format: output)
    }
}

But that creates a memory buffer, and for large video files would not be ideal. Is there a way in Swift to calculate the MD5 Checksum reading a file stream, so the memory footprint will be minimal?

christopher.online
  • 2,614
  • 3
  • 28
  • 52
Mike Walker
  • 2,944
  • 8
  • 30
  • 62
  • Look into using the right combination of `CC_MD5_Init`, `CC_MD5_Update`, and `CC_MD5_Final`. – rmaddy Mar 21 '17 at 18:25

3 Answers3

28

You can compute the MD5 checksum in chunks, as demonstrated e.g. in Is there a MD5 library that doesn't require the whole input at the same time?.

Here is a possible implementation using Swift (now updated for Swift 5)

import CommonCrypto

func md5File(url: URL) -> Data? {

    let bufferSize = 1024 * 1024

    do {
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: url)
        defer {
            file.closeFile()
        }

        // Create and initialize MD5 context:
        var context = CC_MD5_CTX()
        CC_MD5_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
        while autoreleasepool(invoking: {
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_MD5_Update(&context, $0.baseAddress, numericCast(data.count))
                }
                return true // Continue
            } else {
                return false // End of file
            }
        }) { }

        // Compute the MD5 digest:
        var digest: [UInt8] = Array(repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))
        _ = CC_MD5_Final(&digest, &context)

        return Data(digest)

    } catch {
        print("Cannot open file:", error.localizedDescription)
        return nil
    }
}

The autorelease pool is needed to release the memory returned by file.readData(), without it the entire (potentially huge) file would be loaded into memory. Thanks to Abhi Beckert for noticing that and providing an implementation.

If you need the digest as a hex-encoded string then change the return type to String? and replace

return digest

by

let hexDigest = digest.map { String(format: "%02hhx", $0) }.joined()
return hexDigest
Martin R
  • 529,903
  • 94
  • 1,240
  • 1,382
  • 1
    For anyone using this code, you'll want to update to match the edit I just made as it was storing the entire file in the current autorelease pool, potentially consuming tens of gigabytes of memory. – Abhi Beckert Aug 26 '17 at 22:44
  • @AbhiBeckert: Indeed, that makes a huge difference. Thanks for the update! I have modified the code a bit to get rid of the additional exit variable, but that is purely a matter of personal choice. – Martin R Sep 10 '17 at 07:11
  • Excellent answer! Is there a simple way to get the progress while it's running? Say, to update the user in the UI. – Aaron Dec 15 '19 at 21:40
  • Thanks. But, why such the important `autoreleasepool` technique, never mentioned in documentation - https://developer.apple.com/documentation/foundation/filehandle/1413916-readdata ? Is it mentioned elsewhere? – Cheok Yan Cheng Aug 17 '21 at 18:06
  • @CheokYanCheng: `FileHandle` is the Swift name for the Objective-C `NSFileHandle` from the Foundation framework. Autorelease pools are only needed when interacting with Cocoa APIs. For Objective-C they are documented here: https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/MemoryMgmt/Articles/mmAutoreleasePools.html. – Martin R Aug 17 '21 at 21:47
  • If I didn't come across your code snippet, I would use `FileHandle` without autorelease. This morning, I try to run a testing and profiling, over a loop which perform file I/O and image scaling operation, I found with/ without `autoreleasepool` doesn't yield much different in memory usage pattern. Do you know what the thought process should be, to decide `autoreleasepool` usage? - https://stackoverflow.com/questions/68826618/how-can-we-decide-whether-we-should-use-autoreleasepool – Cheok Yan Cheng Aug 18 '21 at 04:41
7

Since iOS13

'CC_MD5_Init' was deprecated in iOS 13.0

You may replace the code with CryptoKit

import Foundation
import CryptoKit

extension URL {

    func checksumInBase64() -> String? {
        let bufferSize = 16*1024

        do {
            // Open file for reading:
            let file = try FileHandle(forReadingFrom: self)
            defer {
                file.closeFile()
            }

            // Create and initialize MD5 context:
            var md5 = CryptoKit.Insecure.MD5()
            
            // Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
            while autoreleasepool(invoking: {
                let data = file.readData(ofLength: bufferSize)
                if data.count > 0 {
                    md5.update(data: data)
                    return true // Continue
                } else {
                    return false // End of file
                }
            }) { }

            // Compute the MD5 digest:
            let data = Data(md5.finalize())
            
            return data.base64EncodedString()
        } catch {
            error_log(error)
            
            return nil
        }
    }
}
Cheok Yan Cheng
  • 47,586
  • 132
  • 466
  • 875
2

Solution (based on Martin R's answer) for SHA256 hash:

func sha256(url: URL) -> Data? {
    do {
        let bufferSize = 1024 * 1024
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: url)
        defer {
            file.closeFile()
        }

        // Create and initialize SHA256 context:
        var context = CC_SHA256_CTX()
        CC_SHA256_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update SHA256 context:
        while autoreleasepool(invoking: {
            // Read up to `bufferSize` bytes
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_SHA256_Update(&context, $0, numericCast(data.count))
                }
                // Continue
                return true
            } else {
                // End of file
                return false
            }
        }) { }

        // Compute the SHA256 digest:
        var digest = Data(count: Int(CC_SHA256_DIGEST_LENGTH))
        digest.withUnsafeMutableBytes {
            _ = CC_SHA256_Final($0, &context)
        }

        return digest
    } catch {
        print(error)
        return nil
    }
}

Usage with instance of type URL with name fileURL previously created:

if let digestData = sha256(url: fileURL) {
    let calculatedHash = digestData.map { String(format: "%02hhx", $0) }.joined()
    DDLogDebug(calculatedHash)
}
christopher.online
  • 2,614
  • 3
  • 28
  • 52