6

Background

I'm developing a simple windows service which monitors certain directories for file creation events and logs these - long story short, to ascertain if a file was copied from directory A to directory B. If a file is not in directory B after X time, an alert will be raised.

The issue with this is I only have the file to go on for information when working out if it has made its way to directory B - I'd assume two files with the same name are the same, but as there are over 60 directory A's and a single directory B - AND the files in any directory A may accidentally be the same as another (by date or sequence) this is not a safe assumption...

Example

Lets say, for example, I store a log that file "E17999_XXX_2111.txt" was created in directory C:\Test. I would store the filename, file path, file creation date, file length and the BOM for this file.

30 seconds later, I detect that the file "E17999_XXX_2111.txt" was created in directory C:\FinalDestination... now I have the task of determining whether;

a) the file is the same one created in C:\Test, therefore I can update the first log as complete and stop worrying about it.

b) the file is not the same and I somehow missed the previous steps - therefore I can ignore this file because it has found its way to the destination dir.

Research

So, in order to determine if the file created in the destination is exactly the same as the one created in the first instance, I've done a bit of research and found the following options:

a) filename compare

b) length compare

c) a creation-date compare

d) byte-for-byte compare

e) hash compare

Problems

a) As I said above, going by Filename alone is too presumptuous.

b) Again, just because the length of the contents of a file is the same, it doesn't necessarily mean the files are actually the same.

c) The problem with this is that a copied file is technically a new file, therefore the creation date changes. I would want to set the first log as complete regardless of the time elapsed between the file appearing in directory A and directory B.

d) Aside from the fact that this method is extremely slow, it appears there's an issue if the second file has somehow changed encoding - for example between ANSII and ASCII, which would cause a byte mis-match for things like ascii quotes

I would like not to assume that just because an ASCII ' has changed to an ANSII ', the file is now different as it is near enough the same.

e) This seems to have the same downfalls as a byte-for-byte compare

EDIT

It appears the actual issue I'm experiencing comes down to the reason for the difference in encoding between directories - I'm not currently able to access the code which deals with this part, so I can't tell why this happens, but I am looking to implement a solution which can compare files regardless of encoding to determine "real" differences (i.e. not those whereby a byte has changed due to encoding)

SOLUTION

I've managed to resolve this now by using the SequenceEqual comparison below after encoding my files to remove any bad data if the initial comparison suggested by @Magnus failed to find a match due to this. Code below:

byte[] bytes1 = Encoding.Convert(Encoding.GetEncoding(1252), Encoding.ASCII, Encoding.GetEncoding(1252).GetBytes(File.ReadAllText(FilePath))); 
byte[] bytes2 = Encoding.Convert(Encoding.GetEncoding(1252), Encoding.ASCII, Encoding.GetEncoding(1252).GetBytes(File.ReadAllText(FilePath))); 

if (Encoding.ASCII.GetChars(bytes1).SequenceEqual(Encoding.ASCII.GetChars(bytes2)))
    { 
    //matched! 
    } 

Thanks for the help!

Danny Lager
  • 371
  • 1
  • 4
  • 17
  • Can't go into full detail right now, but, 'slow' is relative. The .net framework is pretty good at doing string comparisons, and you can convert them to native strings, which will be equatable. I'd go with A + B + D. In that order. Each one disqualifies later tests on failure. Read it as a string to do D – willaien Oct 20 '15 at 18:49
  • Is the changed encoding a real problem that would actually happen? And if so just make sure it doesn't. – Magnus Oct 20 '15 at 18:55
  • My current concept uses A + B + D and I assumed it was working fine until I came across the issue of encoding - I found that whilst a file had been created in directory A, the "matching" file copied into directory B was ever so slightly different - 3 bytes had changed into 1 byte, because an ASCII character quote had been converted to another type of quote by some sort of formatting. I'd like to still have matched these files - this is the real problem I'm facing :-) – Danny Lager Oct 20 '15 at 18:56
  • @Magnus, it did indeed happen, it was totally unexpected, I currently have no idea what caused it but as I'm attempting a one-size-fits-all solution to various code bases I was hoping to be able to come up with a solution that would resolve this regardless of encoding - I did research how to determine the encoding of a file and found this http://stackoverflow.com/a/19283954/5468452 but couldn't work out a way to then convert all files to the same format for the comparison... any tips? – Danny Lager Oct 20 '15 at 18:58
  • @DannyLager The `StreamReader` will auto detect the encoding used. – Magnus Oct 20 '15 at 19:05
  • @Magnus I've not personally attempted to use `StreamReader` to detect encoding but form what I've seen and as stated in the link above it appears it is not very accurate? – Danny Lager Oct 20 '15 at 19:09
  • Problem c: `IO.File.GetLastWriteTime` should be the same on both copies. – rheitzman Oct 20 '15 at 22:05

1 Answers1

6

You would then have to compare the string content if the files. The StreamReader (which ReadLines uses) should detect the encoding.

var areEquals = System.IO.File.ReadLines("c:\\file1.txt").SequenceEqual(
                System.IO.File.ReadLines("c:\\file2.txt"));

Note that ReadLines will not read the complete file into memory.

Magnus
  • 45,362
  • 8
  • 80
  • 118
  • Thanks, I'll give this a go when possible, would this return true regardless of encoding as we are comparing two string literals or would it do the same as a byte-by-byte comparison? – Danny Lager Oct 20 '15 at 19:20
  • Just given this a try with the issue I'm experiencing, using UTF8 Encoding for both - `File.ReadLines(FilePath1, Encoding.UTF8).SequenceEqual(File.ReadLines(FilePath2, Encoding.UTF8))` - this is returning false, yet the only difference in the files is the quote so I assume this is still throwing it off... any suggestions on how to get around this? It was extremely quick running which is a positive... – Danny Lager Oct 21 '15 at 07:33
  • Perhaps the quote character is actually different and it is not an encoding issue. – Magnus Oct 21 '15 at 10:07
  • Turns out I had to re-read both files and then CONVERT them to ASCII encoding before doing the above comparison, if I found that the initial comparison failed. – Danny Lager Oct 21 '15 at 11:33