11

I have a number of files that will live on a server. Users have the ability to create these kinds of files (plists) on-device which will then upload to said server (CloudKit). I would like to unique them by content (the uniquing methodology should be resilient to variations in creation date). My understanding is that I should hash these files in order to obtain unique file names for them. My questions are:

  1. Is my understanding correct that what I want is a hash function?
  2. Which function should I use (from CommonCrypto).
  3. What I need is a digest?
  4. How would I go about it in code? (I assume this should be hashed over an NSData instance?). My understanding from googling around is that I need a bridging header include but beyond that the use of CommonCrypto baffles me. If there is a simpler way using first-party APIs (Apple) I am all ears (I want to avoid using third party code as much as possible).

Thanks so much!

iOS Gamer
  • 476
  • 5
  • 8
  • Various hashing methods here: http://stackoverflow.com/questions/25388747/sha256-in-swift. – Martin R Mar 21 '17 at 17:27
  • 1
    Big warning. Hash functions dont generate unique identifiers. Collision is possible and you should deal with it. – Sulthan Mar 21 '17 at 18:00
  • @Sulthan While that is true cryptographic hashes are safely used to identify files, see Git. – zaph Mar 21 '17 at 18:19
  • 1
    @zaph Very true. The possibility of a collision was discussed several times. Note for example the recent example when two different PDF files with the same SHA1 broke several SVN repositories. I know, it's SHA1 and SHA256 has a much lower probability of such things happening but it can still happen. And if the system is critical, the consequences can be critical. Also, solving the conflict is usually trivial. When speaking about performance, in most situations is not feasible to compute the hash of the entire file (especially big binary files) or to compute a complicated cryptographic hash. – Sulthan Mar 21 '17 at 18:29
  • Indeed, if hashes are different: done, if hashes match: compare the files to the first compare failure or end. – zaph Mar 21 '17 at 18:34
  • Thank you for the discussion everyone! Indeed I will be weary of collisions going forward. @MartinR I went with your code, much prefer avoiding optionals if possible. Thanks! – iOS Gamer Mar 21 '17 at 19:52
  • My error on the optionals, when I removed the `guard` I should have removed the optional return, corrected. I like Martin's code too. – zaph Mar 21 '17 at 21:10

4 Answers4

14

Create a cryptographic hash of each file and you can use that for uniqueness comparisons. SHA-256 is a current hash function and on iOS with Common Crypto is quite fast, on an iPhone 6S SHA256 will process about 1GB/second minus the I/O time. If you need fewer bytes just truncate the hash.

An example using Common Crypto (Swift3)

For hashing a string:

func sha256(string: String) -> Data {
    let messageData = string.data(using:String.Encoding.utf8)!
    var digestData = Data(count: Int(CC_SHA256_DIGEST_LENGTH))

    _ = digestData.withUnsafeMutableBytes {digestBytes in
        messageData.withUnsafeBytes {messageBytes in
            CC_SHA256(messageBytes, CC_LONG(messageData.count), digestBytes)
        }
    }
    return digestData
}
let testString = "testString"
let testHash = sha256(string:testString)
print("testHash: \(testHash.map { String(format: "%02hhx", $0) }.joined())")

let testHashBase64 = testHash.base64EncodedString()
print("testHashBase64: \(testHashBase64)")

Output:
testHash: 4acf0b39d9c4766709a3689f553ac01ab550545ffa4544dfc0b2cea82fba02a3
testHashBase64: Ss8LOdnEdmcJo2ifVTrAGrVQVF/6RUTfwLLOqC+6AqM=

Note: Add to your Bridging Header:

#import <CommonCrypto/CommonCrypto.h>

For hashing data:

func sha256(data: Data) -> Data {
    var digestData = Data(count: Int(CC_SHA256_DIGEST_LENGTH))

    _ = digestData.withUnsafeMutableBytes {digestBytes in
        data.withUnsafeBytes {messageBytes in
            CC_SHA256(messageBytes, CC_LONG(data.count), digestBytes)
        }
    }
    return digestData
}

let testData: Data = "testString".data(using: .utf8)!
print("testData: \(testData.map { String(format: "%02hhx", $0) }.joined())")
let testHash = sha256(data:testData)
print("testHash: \(testHash.map { String(format: "%02hhx", $0) }.joined())")

Output:
testData: 74657374537472696e67
testHash: 4acf0b39d9c4766709a3689f553ac01ab550545ffa4544dfc0b2cea82fba02a3

Also see Martin's link.

zaph
  • 111,848
  • 21
  • 189
  • 228
  • Thanks you for the lighting fast reply! My reading of your code is that the function hashes a string, how would I go about hashing the plist (I assume over a Data object)? Also, is your call to base64EncodedString just an example of truncation? (Sorry I have no idea of cryptography...) – iOS Gamer Mar 21 '17 at 17:38
  • The conversion to UTF-8 *cannot* fail – did we have that discussion before? – Martin R Mar 21 '17 at 17:38
  • @Martin, ah yes,I was copying old code of mine, correcting. Doubly embarrassing since I have corrected another person. – zaph Mar 21 '17 at 17:55
  • Thank you @zaph ! I used a mix of Martin's code and your code for obtaining the string representation. Thanks for the great help! – iOS Gamer Mar 21 '17 at 19:53
  • 1
    are you sure this will work for large files? because it requires the whole data to be in memory, right? @zaph – christopher.online Mar 20 '18 at 15:39
  • 1
    To handle large files you will need to encrypt in sections probably with a file `Stream` class. You will have to use `SHA256_Init` followed by multiple calls so `CC_SHA256_Update` and end with `CC_SHA256_Fina`. – zaph Mar 20 '18 at 16:00
  • @KaraBenNemsi Create a question WRT to large data and streaming and I will provide an answer with code. – zaph Mar 21 '18 at 01:47
  • @zaph https://stackoverflow.com/questions/49516176/sha256-hash-for-large-files-data-on-ios-in-swift – christopher.online Mar 27 '18 at 14:53
10

Solution which also works on large files because it does not require the whole file to be in memory:

func sha256(url: URL) -> Data? {
    do {
        let bufferSize = 1024 * 1024
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: url)
        defer {
            file.closeFile()
        }

        // Create and initialize SHA256 context:
        var context = CC_SHA256_CTX()
        CC_SHA256_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update SHA256 context:
        while autoreleasepool(invoking: {
            // Read up to `bufferSize` bytes
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_SHA256_Update(&context, $0, numericCast(data.count))
                }
                // Continue
                return true
            } else {
                // End of file
                return false
            }
        }) { }

        // Compute the SHA256 digest:
        var digest = Data(count: Int(CC_SHA256_DIGEST_LENGTH))
        digest.withUnsafeMutableBytes {
            _ = CC_SHA256_Final($0, &context)
        }

        return digest
    } catch {
        print(error)
        return nil
    }
}

Usage with instance of type URL with name fileURL previously created:

if let digestData = sha256(url: fileURL) {
    let calculatedHash = digestData.map { String(format: "%02hhx", $0) }.joined()
    DDLogDebug(calculatedHash)
}
christopher.online
  • 2,614
  • 3
  • 28
  • 52
3

As of Swift 5, @chriswillow's answer is still basically correct, but there were some updates to withUnsafeBytes/withUnsafeMutableBytes. These updates make the methods more type-safe, but also moderately more annoying to use.

For the bit using withUnsafeBytes, use:

_ = data.withUnsafeBytes { bytesFromBuffer -> Int32 in
  guard let rawBytes = bytesFromBuffer.bindMemory(to: UInt8.self).baseAddress else {
    return Int32(kCCMemoryFailure)
  }

  return CC_SHA256_Update(&context, rawBytes, numericCast(data.count))
}

For the bit generating the final digest data, use:

var digestData = Data(count: Int(CC_SHA256_DIGEST_LENGTH))
_ = digestData.withUnsafeMutableBytes { bytesFromDigest -> Int32 in
  guard let rawBytes = bytesFromDigest.bindMemory(to: UInt8.self).baseAddress else {
    return Int32(kCCMemoryFailure)
  }

  return CC_SHA256_Final(rawBytes, &context)
}
DesignatedNerd
  • 2,514
  • 1
  • 23
  • 46
3

An update using Apple's CryptoKit: You could use a FileHandle to read the data in chunks, and pass these into the hasher:

import CryptoKit

func getSHA256(forFile url: URL) throws -> SHA256.Digest {
    let handle = try FileHandle(forReadingFrom: url)
    var hasher = SHA256()
    while autoreleasepool(invoking: {
        let nextChunk = handle.readData(ofLength: SHA256.blockByteCount)
        guard !nextChunk.isEmpty else { return false }
        hasher.update(data: nextChunk)
        return true
    }) { }
    let digest = hasher.finalize()
    return digest

    // Here's how to convert to string form
    //return digest.map { String(format: "%02hhx", $0) }.joined()
}
Apptek Studios
  • 512
  • 6
  • 9