0

I'm currently doing an application that migrates millions of images (.TIF) from one NAS to another, and I want to make a validation that allows me to check if the files were copied correctly.

The way I copy is with a function that does this:

Public Function CopyFiles(ByVal origin As String, ByVal copiedFile As String)
    Try
        'Check if file exists
        If File.Exists(copiedFile) = False Then
            My.Computer.FileSystem.CopyFile(origin, copiedFile)
            Log("File copied  succsessfully")
        Else
            Log("File already exists")
        End If
        Return True
    Catch ex As Exception
        Log("Error while copying file " + origin.ToString + " Error:" + ex.ToString)
    End Try
    Return False

I did have this file compare function:

Private Function FileCompare(ByVal file1 As String, ByVal file2 As String) As Boolean
    'Compara byte a byte que los archivos sean iguales.
    'ACTUALMENTE NO SE UTILIZA
    Dim file1byte As Integer
    Dim file2byte As Integer
    Dim fs1 As FileStream
    Dim fs2 As FileStream
    Try

        ' Determine if the same file was referenced two times.
        If (file1 = file2) Then
            ' Return 0 to indicate that the files are the same.
            Return True
        End If

        ' Open the two files.
        fs1 = New FileStream(file1, FileMode.Open)
        fs2 = New FileStream(file2, FileMode.Open)

        ' Check the file sizes. If they are not the same, the files
        ' are not equal.
        If (fs1.Length <> fs2.Length) Then
            ' Close the file
            fs1.Close()
            fs2.Close()

            ' Return a non-zero value to indicate that the files are different.
            Return False
        End If

        ' Read and compare a byte from each file until either a
        ' non-matching set of bytes is found or until the end of
        ' file1 is reached.
        Do
            ' Read one byte from each file.
            file1byte = fs1.ReadByte()
            file2byte = fs2.ReadByte()
        Loop While ((file1byte = file2byte) And (file1byte <> -1))

        ' Close the files.
        fs1.Close()
        fs2.Close()

        ' Return the success of the comparison. "file1byte" is
        ' equal to "file2byte" at this point only if the files are 
        ' the same.
        If ((file1byte - file2byte) = 0) Then
            'Log("******* Archivo Comparado correctamente= " + file1.ToString + "  " + file2.ToString + " *******")
            Return True
        Else
            Log("******* ERROR: al comparar archivos: " + file1.ToString + "  " + file2.ToString + " *******")
            Return False
        End If


    Catch ex As Exception
        Log("******* ERROR, excepcion al comparar archivos: " + file1.ToString + " VS " + file2.ToString + " " + ex.ToString.ToUpper + " *******")
        Return False
    End Try
    Return True
End Function

But it took too long when it started comparing byte by byte every single image, so I was thinking on some other ways to validate that the file has copied correctly.

So far what I have implemented is that I check that the copied file exists, but that doesn't assure me that it didn't copy with any issues.

So my ideas are:

  • Create a function that opens and closes the file just to check if it can open.

  • Create a function that compares the size of the original file and the copied one, but I don't know if there could be any case where the copied file has the same size but with errors.

  • Just leave a function that verifies that the copied file exists, since so far in all my tests, I haven't got any problem with my copied images.

Badja
  • 857
  • 1
  • 8
  • 33
Throkar
  • 73
  • 3
  • 11
  • Why do you want to invent the wheel again. Aren't there already programmes to do that for you? Take for example [robocopy](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/robocopy) – Storax Mar 28 '19 at 14:39
  • @Storax Because the whole process involves updating and inserting several databases aswell, so I decided to create an app todo all of that. – Throkar Mar 28 '19 at 14:43
  • Ok, no problem, I just wanted to give you this hint/tipp. – Storax Mar 28 '19 at 14:50
  • I really appreciate it – Throkar Mar 28 '19 at 14:52
  • If something takes a while on a single thread, it would be worth exploring Multi-Threading - do more things at once. – JayV Mar 28 '19 at 14:55
  • 2
    You don't need to compare byte for byte, you can compare arrays of bytes - that may be quicker. For other comparison methods, look into CRC, Hashes, and checksums – JayV Mar 28 '19 at 14:57
  • @JayV How could I do that? – Throkar Mar 28 '19 at 14:58
  • @Throkar For multi-threading start with https://learn.microsoft.com/en-us/dotnet/standard/threading/using-threads-and-threading. For Checksums [Checking the MD5 of file in VB.NET](https://stackoverflow.com/questions/7930302/checking-the-md5-of-file-in-vb-net) – JayV Mar 28 '19 at 15:01
  • `FileStream` has a [FileStream.Read Method](https://learn.microsoft.com/en-us/dotnet/api/system.io.filestream.read?view=netframework-4.7.2) that can read an array of bytes instead of just a single byte. – Olivier Jacot-Descombes Mar 28 '19 at 15:01
  • Jon Skeet has a great example using MD5 checksum [here](https://stackoverflow.com/questions/10520048/calculate-md5-checksum-for-a-file/10520086#10520086)... I would also *pre compute* your original and then check your new copy against the original; then you are only making one pass... – Trevor Mar 28 '19 at 15:35

1 Answers1

2

The normal way to do this is to hash the files. MD5 is a the common hash function used for this purpose and it is faster than iterating every byte and comparing them. Change your code to the following:

Private Function FileCompare(ByVal file1 As String, ByVal file2 As String) As Boolean
    'Compara byte a byte que los archivos sean iguales.
    'ACTUALMENTE NO SE UTILIZA
    Dim file1byte As Integer
    Dim file2byte As Integer
    Dim fs1 As FileStream
    Dim fs2 As FileStream
    Try

        ' Determine if the same file was referenced two times.
        If (file1 = file2) Then
            ' Return 0 to indicate that the files are the same.
            Return True
        End If

        ' Open the two files.
        fs1 = New FileStream(file1, FileMode.Open)
        fs2 = New FileStream(file2, FileMode.Open)

        ' Check the file sizes. If they are not the same, the files
        ' are not equal.
        If (fs1.Length <> fs2.Length) Then
            ' Close the file
            fs1.Close()
            fs2.Close()

            ' Return a non-zero value to indicate that the files are different.
            Return False
        End If

        Try
            Dim file1Hash as String = hashFileMD5(file1)
            Dim file2Hash as String = hashFileMD5(file2)

            If file1Hash = file2Hash Then
                Return True
            Else
                Return False
            End If

        Catch ex As Exception
            Return False
        End Try


    Catch ex As Exception
        Log("******* ERROR, excepcion al comparar archivos: " + file1.ToString + " VS " + file2.ToString + " " + ex.ToString.ToUpper + " *******")
        Return False
    End Try
    Return True
End Function
Private Function hashFileMD5(ByVal filepath As String) As String
    Using reader As New System.IO.FileStream(filepath, IO.FileMode.Open, IO.FileAccess.Read)
        Using md5 As New System.Security.Cryptography.MD5CryptoServiceProvider
            Dim hashBytes() As Byte = md5.ComputeHash(reader) 
            Return System.Text.Encoding.Unicode.GetString(hashBytes) 
        End Using
    End Using
End Function

Additionally I strongly recommend to run the tasks in parallel as you are processing many files. Use Parallel.ForEach if you are using .NET Framework 4+.

Code Pope
  • 5,075
  • 8
  • 26
  • 68
  • Running in parallel millions of copies won't have any repercussion or generate any possible error? – Throkar Mar 28 '19 at 15:33
  • You can set the maximal degree of parallel threads by `MaxDegreeOfParallelism`, but it is not necessary. It will automatically utilize however many threads the underlying scheduler provides. For more info you can read the documentation of `ParallelOptions.MaxDegreeOfParallelism`. Thus, in the case you have explained it should not generate any problem I think. I am using the same feature for generating documents in an application where up to 250 documents are created in parallel and it is working great. Test it out. – Code Pope Mar 28 '19 at 15:42
  • The [Task Parallel Library (TPL)](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/task-parallel-library-tpl) does not run millions of copies in parallel. It is smart enough to create only a limited number of threads based on the number of processor cores. Then it distributes the (possibly millions of) tasks among these threads automatically. – Olivier Jacot-Descombes Mar 28 '19 at 21:06
  • Thank you, BUT as written the FileCompare function always gives an error! You must close FileStreams fs1 and fs2 before passing the file names to hashFileMD5. – AndruWitta Jun 06 '21 at 07:34