26

Previously I asked a question about combining SHA1+MD5 but after that I understand calculating SHA1 and then MD5 of a lagrge file is not that faster than SHA256. In my case a 4.6 GB file takes about 10 mins with the default implementation SHA256 with (C# MONO) in a Linux system.

public static string GetChecksum(string file)
{
    using (FileStream stream = File.OpenRead(file))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(stream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

Then I read this topic and somehow change my code according what they said to :

public static string GetChecksumBuffered(Stream stream)
{
    using (var bufferedStream = new BufferedStream(stream, 1024 * 32))
    {
        var sha = new SHA256Managed();
        byte[] checksum = sha.ComputeHash(bufferedStream);
        return BitConverter.ToString(checksum).Replace("-", String.Empty);
    }
}

But It doesn't have such a affection and takes about 9 mins.

Then I try to test my file through sha256sum command in Linux for the same file and It takes about 28 secs and both the above code and Linux command give the same result !

Someone advised me to read about differences between Hash Code and Checksum and I reach to this topic that explains the differences.

My Questions are :

  1. What causes such different between the above code and Linux sha256sum in time ?

  2. What does the above code do ? (I mean is it the hash code calculation or checksum calculation? Because if you search about give a hash code of a file and checksum of a file in C#, they both reach to the above code.)

  3. Is there any motivated attack against sha256sum even when SHA256 is collision resistant ?

  4. How can I make my implementation as fast as sha256sum in C#?

Community
  • 1
  • 1
Mohammad Sina Karvandi
  • 1,064
  • 3
  • 25
  • 44
  • 1
    Is there a reason you can't call `sha256sum` from your code using a `Process`? – Nate Diamond Jul 21 '16 at 20:21
  • @NateDiamond Yeah ! First this program must be running on Windows as well as linux. Second is as I mention in my question, I don't whether checksum is secure enough.(Or hash code) – Mohammad Sina Karvandi Jul 22 '16 at 17:23
  • This should be completely throttled by the cost of reading the file off the disk. 9 minutes is not unthinkable but you'd need a cheap laptop with a crappy spindle drive and not enough RAM. Document what you use. – Hans Passant Jul 24 '16 at 12:44
  • @HansPassant Actually I want to now is there any different between Hash Code of a file and Checksum of a file ? – Mohammad Sina Karvandi Jul 25 '16 at 09:06

6 Answers6

25
public string SHA256CheckSum(string filePath)
{
    using (SHA256 SHA256 = SHA256Managed.Create())
    {
        using (FileStream fileStream = File.OpenRead(filePath))
            return Convert.ToBase64String(SHA256.ComputeHash(fileStream));
    }
}
Mariot
  • 266
  • 3
  • 2
  • 15
    The `Convert.ToBase64String` is wrong. You should use `BitConverter.ToString(SHA256.ComputeHash(fileStream)).Replace("-", "").ToLowerInvariant();` otherwise the hash will be wrong. – Daniel Habenicht Mar 23 '21 at 19:20
  • 2
    @Mariot Please update your answer to use `BitConverter.ToString()` instead of `Convert.ToBase64String()`. I don't want to downvote your answer since it is mostly correct. – JamesQMurphy Dec 22 '21 at 14:59
  • 1
    @DanielHabenicht Why is "Convert.ToBase64String()" wrong? – Sachin Joseph Jun 20 '22 at 22:49
  • 1
    @SachinJoseph Hashes / CheckSums are usually represented as hexadecimal strings which you'll get using a `BitConverter.ToString()`. `Convert.ToBase64String()` represents the byte array as Base64 string. While not wrong per se, when you don't use hex representation you won't be able to compare your hashes with results of other tools (i.e. download checksums). – Wolfgang Machert Jul 16 '22 at 09:54
  • Is there any way to monitor the progress of this function *as* it's running? (i.e. keep track of how many bytes have been processed, for example?) – NetXpert Oct 30 '22 at 01:11
13
  1. My best guess is that there's some additional buffering in the Mono implementation of the File.Read operation. Having recently looked into checksums on a large file, on a decent spec Windows machine you should expect roughly 6 seconds per Gb if all is running smoothly.

    Oddly it has been reported in more than one benchmark test that SHA-512 is noticeably quicker than SHA-256 (see 3 below). One other possibility is that the problem is not in allocating the data, but in disposing of the bytes once read. You may be able to use TransformBlock (and TransformFinalBlock) on a single array rather than reading the stream in one big gulp—I have no idea if this will work, but it bears investigating.

  2. The difference between hashcode and checksum is (nearly) semantics. They both calculate a shorter 'magic' number that is fairly unique to the data in the input, though if you have 4.6GB of input and 64B of output, 'fairly' is somewhat limited.

    • A checksum is not secure, and with a bit of work you can figure out the input from enough outputs, work backwards from output to input and do all sorts of insecure things.
    • A Cryptographic hash takes longer to calculate, but changing just one bit in the input will radically change the output and for a good hash (e.g. SHA-512) there's no known way of getting from output back to input.
  3. MD5 is breakable: you can fabricate an input to produce any given output, if needed, on a PC. SHA-256 is (probably) still secure, but won't be in a few years time—if your project has a lifespan measured in decades, then assume you'll need to change it. SHA-512 has no known attacks and probably won't for quite a while, and since it's quicker than SHA-256 I'd recommend it anyway. Benchmarks show it takes about 3 times longer to calculate SHA-512 than MD5, so if your speed issue can be dealt with, it's the way to go.

  4. No idea, beyond those mentioned above. You're doing it right.

For a bit of light reading, see Crypto.SE: SHA51 is faster than SHA256?

Edit in response to question in comment

The purpose of a checksum is to allow you to check if a file has changed between the time you originally wrote it, and the time you come to use it. It does this by producing a small value (512 bits in the case of SHA512) where every bit of the original file contributes at least something to the output value. The purpose of a hashcode is the same, with the addition that it is really, really difficult for anyone else to get the same output value by making carefully managed changes to the file.

The premise is that if the checksums are the same at the start and when you check it, then the files are the same, and if they're different the file has certainly changed. What you are doing above is feeding the file, in its entirety, through an algorithm that rolls, folds and spindles the bits it reads to produce the small value.

As an example: in the application I'm currently writing, I need to know if parts of a file of any size have changed. I split the file into 16K blocks, take the SHA-512 hash of each block, and store it in a separate database on another drive. When I come to see if the file has changed, I reproduce the hash for each block and compare it to the original. Since I'm using SHA-512, the chances of a changed file having the same hash are unimaginably small, so I can be confident of detecting changes in 100s of GB of data whilst only storing a few MB of hashes in my database. I'm copying the file at the same time as taking the hash, and the process is entirely disk-bound; it takes about 5 minutes to transfer a file to a USB drive, of which 10 seconds is probably related to hashing.

Lack of disk space to store hashes is a problem I can't solve in a post—buy a USB stick?

Michael
  • 8,362
  • 6
  • 61
  • 88
Richard Petheram
  • 805
  • 12
  • 16
  • well , It's amazing that sha256 is slower than sha512 ! Well I have another question. What is the code that I mention above doing ? It might be ridicules but I can't find anything diffrent when I search about 'Get file Checksum' and 'Get file hash'. They both give same result. Seems that people don't know what are they exactly doing :). (Like me!) – Mohammad Sina Karvandi Jul 25 '16 at 20:45
  • And another thing is I can't store 128 byte for sha512 ! There tons of file there and I don't have enough volume. – Mohammad Sina Karvandi Jul 25 '16 at 21:00
4

Way late to the party but seeing as none of the answers mentioned it, I wanted to point out:

SHA256Managed is an implementation of the System.Security.Cryptography.HashAlgorithm class, and all of the functionality related to the read operations are handled in the inherited code.

HashAlgorithm.ComputeHash(Stream) uses a fixed 4096 byte buffer to read data from a stream. As a result, you're not really going to see much difference using a BufferedStream for this call.

HashAlgorithm.ComputeHash(byte[]) operates on the entire byte array, but it resets the internal state after every call, so it can't be used to incrementally hash a buffered stream.

Your best bet would be to use a third party implementation that's optimized for your use case.

Charles Grunwald
  • 1,441
  • 18
  • 22
2
using (SHA256 SHA256 = SHA256Managed.Create())
            {
                using (FileStream fileStream = System.IO.File.OpenRead(filePath))
                {
                    string result = "";
                    foreach (var hash in SHA256.ComputeHash(fileStream))
                    {
                        result += hash.ToString("x2");
                    }

                    return result;
                }
            }

For Reference: https://www.c-sharpcorner.com/article/how-to-convert-a-byte-array-to-a-string/

Tushar
  • 21
  • 1
  • 4
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 02 '21 at 06:20
0
using System.Security.Cryptography;

using (var fileStream = System.IO.File.Create(filePath)){
  using (var sha = SHA256.Create())
  {
   var hash = Convert.ToBase64String(sha.ComputeHash(fileStream));
  }
}
  • Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Mar 14 '23 at 01:14
0

Try this, it worked for me and I double-checked the hashes with PoweShell and another Python script too. (Apologies in advance for the weird identation)

using System;
using System.IO;
using System.Security.Cryptography;
    
public static string GetExecutableHash(string fullPathToFile)
            /* Returns HASH-256 of a given executable file. */
            {
                string hash = string.Empty;
    
                using (FileStream fileStream = new FileInfo(fullPathToFile).Open(FileMode.Open))
                {
                    try
                    {
                        fileStream.Position = 0;
                        byte[] hashValue = SHA256.Create().ComputeHash(fileStream);
                        hash = BitConverter.ToString(hashValue).Replace("-", String.Empty).ToLower();
                    }
                    catch (IOException e)
                    {
                        Console.WriteLine($"I/O Exception: {e.Message}");
                    }
                    catch (UnauthorizedAccessException e)
                    {
                        Console.WriteLine($"Access Exception: {e.Message}");
                    }
                }
                return hash;
            }
MJP
  • 1
  • 3
  • Thank you for contributing to the Stack Overflow community. This may be a correct answer, but it’d be really useful to provide additional explanation of your code so developers can understand your reasoning. This is especially useful for new developers who aren’t as familiar with the syntax or struggling to understand the concepts. **Would you kindly [edit] your answer to include additional details for the benefit of the community?** – Jeremy Caney Jul 01 '23 at 17:56