0

I know it is always better to use check sum algorithms that people already invented. I want to be able to compare if two files are the same by performing the check-sum. The files are in two different computers over a network and because they are across a network it is faster to perform a check sum rather than copying the entire file when dealing with large files like in my case. (I will perform other test first such as making sure the files are the same length etc..)

so I have created this simple algorithm:

private static double GetChecksum2(string file)
    {
        double checkSum = 0;

        var stream = File.OpenRead(file);

        // the bigger the chunck size the faster but the more memory usage by cpu
        // also when sending file over network it should not be that much more efficient

        int chunckSize = (int) Math.Pow(2,20); // 10 => kilobite   20 => megabite  30 => gigabite etc..
        byte[] buffer = new byte[chunckSize];

        int bytesRead = 0;

        while ( // while bytesRead > 0
            (bytesRead =
                (stream.Read(buffer, 0, buffer.Length)) // returns the number of bytes read or 0 if no bytes read
            ) > 0)
        {
            //buffer is now an array of size bytesRead

            // write those bytes to a file, perform checksum of file
            // etc...


            // temp check sum use a better algorithm I dont know if other computers will round 
            // doubles diferently

            for (int i = 0; i < bytesRead; i++)
            {
                checkSum = (((buffer[i] + i)/2 + checkSum))*.45;
            }


            //SHA256Managed sha = new SHA256Managed();
            //byte[] checksum = sha.ComputeHash(buffer);

        }

        return checkSum;
    }

I don't know what are the odds that the checksum of two different files comes true using this algorithm.

when performing the checksum of a file of 1.06 GB it takes: 5.2 seconds to complete and the checksum comes out to be 321840.207306214

when I use the SHA256Managed() algorithm instead it takes 35.8 seconds.

7 times longer

I know that the odds of two files having the same checksum with this algorithm them being different is much lower than with my algorithm. But using my algorithm is much more faster and I think that the odds should be also pretty low...

Or perhaps I should use a even faster algorithm that I don't know and it already exists...

edit

My question is:

will it be safe to implement this algorithm. I require to do a lot of file transfer over my network and it will be nice if I can use a checksum algorithm to compare files. Maybe I could split each file in chuncks and just replace the chuncks where the checksum does not match!

Tono Nam
  • 34,064
  • 78
  • 298
  • 470
  • 3
    Check out CRC32 implementations, they're pretty fast. Also, you shouldn't use DOUBLE, as you're almost definitely going to run into issues with DOUBLES not being exact. – Matthew Dec 05 '11 at 19:37
  • How about a classic CRC32 or CRC64? – CodesInChaos Dec 05 '11 at 19:37
  • is this algorithm reliable to use when comparing files that have the same length... – Tono Nam Dec 05 '11 at 19:38
  • I'm pretty surprised that you're not IO bound. And why SHA256 *managed*? I think the unmanaged implementation is significantly faster. MD5 might be even faster(not cryptographically secure, but you don't seem to care) – CodesInChaos Dec 05 '11 at 19:38
  • If two **1GB** files differ in their first few bytes.. computing the the checksum of the whole file will be considerably slower – parapura rajkumar Dec 05 '11 at 19:38
  • not if the files are over a network – Tono Nam Dec 05 '11 at 19:45
  • @Tano 1GB files differ in their first few bytes your algorithm won't even notice. It's easy to be fast, if you don't need to be correct :P – CodesInChaos Dec 05 '11 at 19:45
  • @TonoNam - Your algorithm is faster because its not actually finding the checksum. – Security Hound Dec 05 '11 at 19:48
  • the checksum is included in the formula so I believe it returns a checksum. I also tried it with several files. It might work different on a different operating system but I think that a double should be the same size on multiple opertating systems. the CRC32 algrorithm takes 1 second longer than mine so why reinvent the wheel. I did not know that hashing algorithms where that fast. I tried md5 and sha and they where slow compared to mine that's why I wanted to implement one... – Tono Nam Dec 05 '11 at 21:35

2 Answers2

3

Floating point math is non deterministic. You might get slightly different results on different computers or versions of .net. In your algorithm that can be avoided with an epsilon comparison, but in many algorithms it can't be avoided at all.

Another issue with your algorithm is that the contribution of early bytes becomes exponentially small. i.e. only the last part of the file influences the hash. A quick estimate is that only the last few kB are taken into account. This means your hash it not fit for its purpose.

If we neglect rounding errors we can simplify your formula:

(((buffer[i] + i)/2 + checkSum))*.45

buffer[i]*0.45/2 + i*0.45/2 + checkSum*0.45

Solving the recursion gives us:

Sum(buffer[i]/2*(0.45^(length-1)) + i*(0.45^(length-1)))

The second term does only depend on the length, so when comparing files with equal length you're left with:

Sum(buffer[i]/2*(0.45^(length-1)))
CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
1

Using a double for a checksum is prone to floating-point problems. I think that's a really bad idea. I also think that re-inventing the wheel is also a poor decision. There are many checksum algorithms available for you re-use.

Also, some related questions:

Community
  • 1
  • 1
Cᴏʀʏ
  • 105,112
  • 20
  • 162
  • 194