1

I'm uloading files to a ftp and need to make sure that the files are transmitted correctly. For this I'm redownloading the file afterwards and check if the file is the same as the original local file contentwise. To do this I read each file in small chunks and generate MD5 sums over the content.

Even though MD5 is limited in what it can represent I think it is enough in terms of seeing if there is a difference for teh files (usually up to 2 MB in size). Now though when I generated the MD5 for each stream (one being the download stream the other being the lokal file read stream) I got the problem that the MD5 is different for each of the two (despite the first few chunks being identical in terms of the MD5. And the file itself being a zip file being extractable without problems lokally and on the ftp server).

What I would like to know there is: Am I erring in the idea itself? OR am I having an error in my code? OR why are the contents seemingly different?

The calls:

ftpMD5 = GeneriereMD5FuerStream(ftpAnsuchen.GetResponse().GetResponseStream());
lokalMD5 = GeneriereMD5FuerStream((new FileInfo(lokaleDateiPfad)).OpenRead());

if (ftpMD5.Equals(lokalMD5) == false)
{
    throw exception "Different";
}

The code for the method:

    private string GeneriereMD5FuerStream(Stream leseStream)
    {
        string md5String = String.Empty;
        byte[] leseBuffer = new byte[2048];
        int bytesGelesen = 0;
        MD5 md5Converter = MD5.Create();

        bytesGelesen = leseStream.Read(leseBuffer, 0, leseBuffer.Length);
        md5String = BitConverter.ToString(md5Converter.ComputeHash(Encoding.Default.GetBytes(md5String + BitConverter.ToString(md5Converter.ComputeHash(leseBuffer)))));

        while (bytesGelesen > 0)
        {
            bytesGelesen = leseStream.Read(leseBuffer, 0, leseBuffer.Length);

            if (bytesGelesen > 0) 
            {
                md5String = BitConverter.ToString(md5Converter.ComputeHash(Encoding.Default.GetBytes(md5String + BitConverter.ToString(md5Converter.ComputeHash(leseBuffer)))));
            }
        }

        return md5String;
    }
Thomas
  • 2,886
  • 3
  • 34
  • 78
  • In the meantime I have circumvented the problem by just using return BitConverter.ToString((MD5.Create()).ComputeHash(leseStream)); but I would still be interested why the manual creation of the MD5 is failing there. – Thomas Nov 02 '15 at 13:20
  • Yes, and the way you are computing the hash means that the hash will be different depending on the size of the chunks. – keith Nov 07 '15 at 12:51

3 Answers3

1

I highly recommend downloading the entire file and performing a hash over the entire file contents.

As keith mentioned you cannot guarantee your buffer will contain a set of values as network latency may play an issue. The other method would be to calulate md5 hashes at set byte intervals not based on the buffer, but again you end up downloading the entire file in the end, so just do that from the beginning.

MD5.ComputeHash also has a stream overload you should be using.

https://msdn.microsoft.com/en-us/library/system.security.cryptography.md5(v=vs.110).aspx

AIDA
  • 519
  • 2
  • 14
0

Is it necessary to calculate hashes in your case? I suppose there is another way to check if file is transmitted. You can base on BackgroundWorker and monitor progress when uploading file. Somebody described it HERE.

Community
  • 1
  • 1
marcinax
  • 1,067
  • 7
  • 10
  • When I used a tool to transfer a 100 MB zip File for example I got no error, the file was there but corrupted (aka not extractable). The hashes idea stems from this occurance so that I check if the file is correct (even though it COULD be corrupted during the redownload it is better to say one time too often "error" than 1 time too less in my case). so yes I fear it is necessary to check if all bytes are the same on ftp and local. Only question is is if this is the correct way to do it and if so why my code fails. – Thomas Oct 29 '15 at 08:27
0

The way you are computing the MD5 has is sensitive to the number of bytes read into the buffer.

Looking at this line:

bytesGelesen = leseStream.Read(leseBuffer, 0, leseBuffer.Length);

For the file, it will be the case that bytesGelesen will always be leseBuffer.Length until the last block is read.

For a network stream, it's likely that bytesGelesen will not be the full size of leseBuffer.

You have two options, read the file from the network stream to disk, and then use your current mewthod on this file to compute the hash (to ensure consistency of the bytes read on each iteration of Read) or change your hash calculation so that it returns the same value regardless of the length of bytes read on each call to read.

To prove my theory simply write out bytesGelesen when pulling the file from the FTP server and compare to when reading the file form disk.

keith
  • 5,122
  • 3
  • 21
  • 50
  • Ah so you mean that despite the files being identical the download from the ftp could end up with differently sized chunks if the ftp is a bit too slow in between and thus sends less then more again,... ? – Thomas Nov 07 '15 at 08:24