21

I'm writing a back up solution (of sorts). Simply it copies a file from location C:\ and pastes it to location Z:\

To ensure the speed is fast, before copying and pasting it checks to see if the original file exists. If it does, it performs a few 'calculations' to work out if the copy should continue or if the backup file is up to date. It is these calculations I'm finding difficult.

Originally, I compared the file size but this is not good enough because it would be very possible to change a file and it to be the same size (for example saving the character C in notepad is the same size as if I saved the Character T).

So, I need to find out if the modified date differs. At the moment, I get the file info using the FileInfo class but after reviewing all the fields there is nothing which appears to be suitable.

How can I check to ensure that I'm copying files which have been modified?

EDIT I have seen suggestions on SO to use MD5 checksums, but I'm concerned this may be a problem as some of the files I'm comparing will be up to 10GB

Steve Konves
  • 2,648
  • 3
  • 25
  • 44
Dave
  • 8,163
  • 11
  • 67
  • 103
  • 2
    There's that nice meta attribute that most file systems have, generally called "last modified time". – user703016 Oct 22 '12 at 15:33
  • But I don't get that from the FileInfo - I agree it is probably perfect but I don't know which class will provide me that information. – Dave Oct 22 '12 at 15:34
  • 1
    FileInfo.LastWriteTime doesn't have this information? That's the impression I got from this question:http://stackoverflow.com/questions/1185378/how-to-get-modified-date-from-file-in-c-sharp – JoshVarty Oct 22 '12 at 15:34
  • `FileInfo.LastWriteTime` doesn't help? How do you plan to handle changes to/from daylight saving time or any other clock adjustments? – HABO Oct 22 '12 at 15:35
  • I created a text file on my C:\ drive, I then copied it to my Z:\ drive 10 minutes later via my program. I then ran my program again and compared the LastWriteTime of the 2 files and they are different. – Dave Oct 22 '12 at 15:36
  • 1
    Perhaps this might help: http://stackoverflow.com/a/1358529/1220971 – Bridge Oct 22 '12 at 15:36
  • @Bridge - as per my last comment - it's going to be too slow for bigger files (although I appreicate it may be the answer if there is no other solution) – Dave Oct 22 '12 at 15:37
  • 1
    @DaveRook Some of the other answers on that question might be worth looking at then. :-) – Bridge Oct 22 '12 at 15:38
  • 1
    There's no other way to check if any byte in the file could have possibly been changed, except for comparing both files byte-by-byte which will probably be slower. – Mike Marynowski Oct 22 '12 at 15:44
  • @MikeMarynowski - Thank you; this answers a big part for me - now to continue with hash symbols - thank you. – Dave Oct 22 '12 at 15:46

6 Answers6

28

Going by modified date will be unreliable - the computer clock can go backwards when it synchronizes, or when manually adjusted. Some programs might not behave well when modifying or copying files in terms of managing the modified date.

Going by the archive bit might work in a controlled environment but what happens if another piece of software is running that uses the archive bit as well?

The Windows archive bit is evil and must be stopped

If you want (almost) complete reliability then what you should do is store a hash value of the last backed up version using a good hashing function like SHA1, and if the hash value changes then you upload the new copy.

Here is the SHA1 class along with a code sample on the bottom:

http://msdn.microsoft.com/en-us/library/system.security.cryptography.sha1.aspx

Just run the file bytes through it and store the hash value. Pass a FileStream to it instead of loading your file into memory with a byte array to reduce memory usage, especially for large files.

You can combine this with modified date in various ways to tweak your program as needed for speed and reliability. For example, you can check modified dates for most backups and periodically run a hash checker that runs while the system is idle to make sure nothing got missed. Sometimes the modified date will change but the file contents are still the same (i.e. got overwritten with the same data), in which case you can avoid resending the whole file after you recompute the hash and realize it is still the same.

Most version control systems use some kind of combined approach with hashes and modified dates.

Your approach will generally involve some kind of risk management with a compromise between performance and reliability if you don't want to do a full backup and send all the data over each time. It's important to do "full backups" once in a while for this reason.

Mike Marynowski
  • 3,156
  • 22
  • 32
  • For clarity, when you say store the hash, do you mean in an external file or database (or the like)? – Dave Oct 22 '12 at 15:38
  • 3
    That depends on how your system is implemented :) You can keep a database of the values, or you can do what subversion used to do and create a hidden directory inside the backed up location that contains the hashes of all the files that got backed up. Subversion moved away from that and now keeps a database in a hidden directory only in the root of the versioned directory structure. – Mike Marynowski Oct 22 '12 at 15:40
  • I see - but this would rely on storing this data else where - interesting. Thank you for taking the time and helping. – Dave Oct 22 '12 at 15:47
  • 1
    This is fine for source code/documents but isn't really fast enough for large binaries etc. – Robbie Dee Oct 23 '12 at 09:58
  • 1
    Depends how you define "fast enough" - for a weekly or nightly unattended backup process, done during idle time, this can go through even 100GB of data in a reasonable amount of time. I do like the archive bit solution in a controlled environment, but I'd be weary to trust it depending on where my backup process would be running. – Mike Marynowski Oct 24 '12 at 16:10
  • Check for the file size first, and if file size is the same, then calculate the checksum. – devunt Apr 25 '18 at 06:09
  • @devunt This would be unreliable. You could potentially change a file and still have the exact same file size. – Stephen P. May 06 '18 at 14:30
21

You can compare files by their hashes:

private byte[] GetFileHash(string fileName)
{
    HashAlgorithm sha1 = HashAlgorithm.Create();
    using(FileStream stream = new FileStream(fileName,FileMode.Open,FileAccess.Read))
      return sha1.ComputeHash(stream);
}

If content was changed, hashes will be different.

Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
12

You may like to check out the FileSystemWatcher class.

"This class lets you monitor a directory for changes and will fire an event when something is modified."

Your code can then handle the event and process the file.

Code source - MSDN:

// Create a new FileSystemWatcher and set its properties.
FileSystemWatcher watcher = new FileSystemWatcher();
watcher.Path = args[1];

/* Watch for changes in LastAccess and LastWrite times, and
   the renaming of files or directories. */
watcher.NotifyFilter = NotifyFilters.LastAccess | NotifyFilters.LastWrite
   | NotifyFilters.FileName | NotifyFilters.DirectoryName;

// Only watch text files.
watcher.Filter = "*.txt";

// Add event handlers.
watcher.Changed += new FileSystemEventHandler(OnChanged);
watcher.Created += new FileSystemEventHandler(OnChanged);
watcher.Deleted += new FileSystemEventHandler(OnChanged);
watcher.Renamed += new RenamedEventHandler(OnRenamed);
dsgriffin
  • 66,495
  • 17
  • 137
  • 137
  • 1
    My program is not designed to watch a folder 24/7, only check 2 files on the fly (at time of copy/paste). So +1 as this is good information and useful as an alternative but I'm looking to compare 2 files – Dave Oct 22 '12 at 15:35
  • 1
    FYI, this doesn't seem to be a Mono-compatible solution – joelc Jun 14 '16 at 21:35
  • I look into the source code, I see there is a while loop continuously running. Won't it keep the processor busy or overhead? How OS manages this? – Omar Faroque Anik Mar 18 '19 at 18:26
  • @OmarFaroqueAnik - threading, the process lives in a separate thread and the OS handles that by choosing which threads to execute, what can be handled simultaneously and what can't, as well as decide what to execute at the point of i/o. – Paul Carlton Dec 24 '22 at 17:55
1

Generally speaking, you'd let the OS take care of tracking whether a file has changed or not.

If you use:

File.GetAttributes

And check for the archive flag, this will tell you if the file has changed since it was last archived. I believe XCOPY and similar reset this flag once it has done the copy, but you may need to take care of this yourself.

You can easily test the flag in DOS using:

dir /aa yourfilename

Or just add the attributes column in windows explorer.

Robbie Dee
  • 1,939
  • 16
  • 43
1

The file archive flag is normally used by backup programs to check whether a file needs backing up. When Windows modifies or creates a file, it sets the archive flag (see here). Check whether the archive flag is set to decide whether the file needs backing up:

if ((File.GetAttributes(fileName) & FileAttributes.Archive) == FileAttributes.Archive)
{
    // Archive file.
}

After backing up the file, clear the archive flag:

File.SetAttributes(fileName, File.GetAttributes(fileName) & ~FileAttributes.Archive);

This assumes no other programs (e.g., system backup software) are clearing the archive flag.

Polyfun
  • 9,479
  • 4
  • 31
  • 39
0

From this article get the Crc32 class Calculating CRC-32 in C# and .NET

Pass your file path to this function... It returns a CRC value... compare it to your file that already exists... if the CRC's are different then the file is changed.

internal Int32 GetCRC(string filepath)
{
    Int32 ret = 0;
    StringBuilder hash = new StringBuilder();
    try
    {
        Crc32 crc32 = new Crc32();
                
        using (System.IO.FileStream fs = File.Open(filepath, FileMode.Open, FileAccess.Read, FileShare.None))
            foreach (byte b in crc32.ComputeHash(fs)) hash.Append(b.ToString("x2").ToLower());
                
        ret = Int32.Parse(hash.ToString(), System.Globalization.NumberStyles.HexNumber);
    }
    catch (Exception ex)
    {
        string msg = (ex.InnerException == null) ? ex.Message : ex.InnerException.Message;
        Console.WriteLine($"FILE ERROR: {msg}");
        
        ret = 0;
    }
    finally
    {
        hash.Clear();
        hash = null;
    }
            
    return ret;
}
ecklerpa
  • 159
  • 1
  • 7