0

I've built a file copying routine into a common library for a variety of different (WinForms) applications I'm currently working on. What I've built implements the commonly-used CopyFileEx method to actually perform the file copy while displaying the progress, which seems to be working great.

The only real issue I'm encountering is that, because most of the file copying I'm doing is for archival purposes, once the file is copied, I would like to "verify" the new copy of the file. I have the following methods in place to do the comparison/verification. I'm sure many of you will quickly see where the "problem" is:

Public Shared Function CompareFiles(ByVal File1 As IO.FileInfo, ByVal File2 As IO.FileInfo) As Boolean
    Dim Match As Boolean = False

    If File1.FullName = File2.FullName Then
        Match = True
    Else
        If File.Exists(File1.FullName) AndAlso File.Exists(File2.FullName) Then
            If File1.Length = File2.Length Then
                If File1.LastWriteTime = File2.LastWriteTime Then
                    Try
                        Dim File1Hash As String = HashFileForComparison(File1)
                        Dim File2Hash As String = HashFileForComparison(File2)

                        If File1Hash = File2Hash Then
                            Match = True
                        End If
                    Catch ex As Exception
                        Dim CompareError As New ErrorHandler(ex)

                        CompareError.LogException()
                    End Try
                End If
            End If
        End If
    End If

    Return Match
End Function

Private Shared Function HashFileForComparison(ByVal OriginalFile As IO.FileInfo) As String
    Using BufferedFileReader As New IO.BufferedStream(File.OpenRead(OriginalFile.FullName), 1200000)
        Using MD5 As New System.Security.Cryptography.MD5CryptoServiceProvider
            Dim FileHash As Byte() = MD5.ComputeHash(BufferedFileReader)

            Return System.Text.Encoding.Unicode.GetString(FileHash)
        End Using
    End Using
End Function

This CompareFiles() method checks a few of the "simple" elements first:

  • Is it trying to compare a file to itself? (if so, always return True)
  • Do both files actually exist?
  • Are the two files the same size?
  • Do they both have the same modification date?

But, you guessed it, here's where the performance takes the hit. Especially for large files, the MD5.ComputeHash method of the HashFileForComparison() method can take a while - about 1.25 minutes for a 500MB file for a total of about 2.5 minutes to compute both hashes for the comparison. Does anyone have a better suggestion for how to more efficiently verify the new copy of the file?

G_Hosa_Phat
  • 976
  • 2
  • 18
  • 38
  • 1
    Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your `If` statements. And lastly, writing `If ‹condition› then Variable = True` is an anti-pattern. Just write `Variable = ‹condition›`. – Konrad Rudolph Nov 13 '18 at 17:14
  • To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date *should* generally indicate that the files are "the same", but I want to leave out as much room for error as possible. – G_Hosa_Phat Nov 13 '18 at 17:20
  • 2
    One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation. – Konrad Rudolph Nov 13 '18 at 17:29
  • I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. https://stackoverflow.com/questions/1358510/how-to-compare-2-files-fast-using-net – G_Hosa_Phat Nov 13 '18 at 20:45
  • I may be missing something here, but your function `HashFileForComparison` has the code `Using BufferedFileReader ..`, but you don't use `BufferedFileReader` you refer to `FileReader` in your sample - is this a typo in the pasted sample or a mistake in your code? – David Wilson Nov 15 '18 at 11:29
  • @DavidWilson - Yes, that's a typo. At one point I was using a plain `Stream` object (named `FileReader`). I changed it to a `BufferedStream` object as I was posting and forgot to change that line in the code I posted. I've corrected the code for that above. – G_Hosa_Phat Nov 15 '18 at 14:28
  • Seems like it would be possible to copy a file, open the copy for write to change a byte, then change it back, and have to identical files with different write times. Unix even has the `touch` command to update write times without changing anything. – Joel Coehoorn Nov 15 '18 at 14:54
  • @JoelCoehoorn - That's pretty much the main reason I do the "simple" checks first to compare the `.Length` and `.LastWriteTime` properties of the two files before I go to the trouble of hashing them. The hashing only comes into play if the two files are already identical in these ways. If not, they aren't identical and the copy process "fails". – G_Hosa_Phat Nov 15 '18 at 14:59
  • If your function is comparing a file to itself, I'd throw an exception instead of returning True. Returning True will mask a programming logic error in this case, making one think two different files are being compared when they're not. – HardCode Nov 15 '18 at 15:03
  • @G_Hosa_Phat My point is the file might **fail** one of your simple checks, but still be identical. You should remove the LastWriteTime check. You should also remove the `File.Exists()` checks. Just assume the files exist, and handle the exception to return `false` if they fail. That will save you a couple extra round trips out to the hard disk. – Joel Coehoorn Nov 15 '18 at 15:07

0 Answers0