16

Before you mark this question as duplicate please read what I write. I have checked many questions in a lot of pages for the solution but could not find anything. On my current application I was using this :

using (var md5 = MD5.Create())
{
    using (FileStream stream = File.OpenRead(FilePath))
    {
        var hash = md5.ComputeHash(stream);
        var cc = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
        Console.WriteLine("Unique ID  : " + cc);
    }
}

This was working well enough to me for small sized files but once I try it with high size files it took me around 30-60 second to get the file ID.

I wonder if there is any other way to get something unique from a file with or without using hashing or stream? My target machine is not NTFS or windows all the time so I have to find another way.

I was wondering if it makes sense if I just get the first "x" amount of bytes from the stream and do the hashing for unique ID with that lowered-size stream?

EDIT : It's not for security thing or anything else, I need this unique ID because FileSystemWatcher is not working :)

EDIT2: Based on comments I decide to update my question. The reason why I do this maybe there is a solution that is not based on creating unique ID's for file. My problem is I have to watch a folder and fire events when there are; A) Newly added files B) Changed files C) Deleted files

The reason why I can't use FileSystemWatcher is it's not reliable. Sometimes I put 100x file to the folder and FileSystemWatcher only fires 20x-30x events and if it's network drive it can be lower sometimes. My method was saving all the files and their unique ID's into a text file and check the index file every 5 second if there are any changes. If there are no big files like 18GB it's working fine.. But computing hash of 40GB file takes way too long.. My question is : How can I fire events when something happen to the folder I am watching

EDIT3: After setting bounty I realized I need to give more information about what's going on in my code. First this is my answer to user @JustShadow (It was too long so I could not send it as comment) I will explain how I do it, I save filepath-uniqueID(MD5 hashed) in text file and every 5 second I check the folder with Directory.GetFiles(DirectoryPath); Then I compare my first list with the list I had 5 second ago and this way I get 2 lists

List<string> AddedList = FilesInFolder.Where(x => !OldList.Contains(x)).ToList();
List<string> RemovedList = OldList.Where(x => !FilesInFolder.Contains(x)).ToList();

This is how I get them. Now I have my if blocks,

if (AddedList.Count > 0 && RemovedList.Count == 0) then it's nice no renames only new files. I hash all new files and add them into my textfile.

if (AddedList.Count == 0 && RemovedList.Count > 0)

Opposite of first if still nice there are only removed item, I remove them from text file on this one and its done. After this situations there comes my else block .. Which is where I do my comparing, basically I hash all added and removed list items then I take the ones that exists in both list, as example a.txt renamed into b.txt in this case both of my list's count will be greater then zero so else triggered. Inside else I already know a's hashed value (it's inside my text file I have created 5 second ago) now I compare it with all AddedList elements and see if I can match them if I get a match then it's a rename situation if there is no match then I can say b.txt has really newly added to list since last scan. I will also provide some of my class code so maybe there is a way to solve this riddle.

Now I will also share some of my class code maybe we can find a way to solve it when everyone knows what I'm actually doing. This is how my timer looks like

private void TestTmr_Elapsed(object sender, System.Timers.ElapsedEventArgs e)
        {

            lock (locker)
            {
                if (string.IsNullOrWhiteSpace(FilePath))
                {
                    Console.WriteLine("Timer will be return because FilePath is empty. --> " + FilePath);
                    return;
                }
                try
                {
                    if (!File.Exists(FilePath + @"\index.MyIndexFile"))
                    {
                        Console.WriteLine("File not forund. Will be created now.");
                        FileStream close = File.Create(FilePath + @"\index.MyIndexFile");
                        close.Close();
                        return;
                    }

                    string EncryptedText = File.ReadAllText(FilePath + @"\index.MyIndexFile");
                    string JsonString = EncClass.Decrypt(EncryptedText, "SecretPassword");
                    CheckerModel obj = Newtonsoft.Json.JsonConvert.DeserializeObject<CheckerModel>(JsonString);
                    if (obj == null)
                    {
                        CheckerModel check = new CheckerModel();
                        FileInfo FI = new FileInfo(FilePath);
                        check.LastCheckTime = FI.LastAccessTime.ToString();
                        string JsonValue = Newtonsoft.Json.JsonConvert.SerializeObject(check);

                        if (!File.Exists(FilePath + @"\index.MyIndexFile"))
                        {
                            FileStream GG = File.Create(FilePath + @"\index.MyIndexFile");
                            GG.Close();
                        }

                        File.WriteAllText(FilePath + @"\index.MyIndexFile", EncClass.Encrypt(JsonValue, "SecretPassword"));
                        Console.WriteLine("DATA FILLED TO TEXT FILE");
                        obj = Newtonsoft.Json.JsonConvert.DeserializeObject<CheckerModel>(JsonValue);
                    }
                    DateTime LastAccess = Directory.GetLastAccessTime(FilePath);
                    string[] FilesInFolder = Directory.GetFiles(FilePath, "*.*", SearchOption.AllDirectories);
                    List<string> OldList = new List<string>(obj.Files.Select(z => z.Path).ToList());

                    List<string> AddedList = FilesInFolder.Where(x => !OldList.Contains(x)).ToList();
                    List<string> RemovedList = OldList.Where(x => !FilesInFolder.Contains(x)).ToList();


                    if (AddedList.Count == 0 & RemovedList.Count == 0)
                    {
                        //no changes.
                        Console.WriteLine("Nothing changed since last scan..!");
                    }
                    else if (AddedList.Count > 0 && RemovedList.Count == 0)
                    {
                        Console.WriteLine("Adding..");
                        //Files added but removedlist is empty which means they are not renamed. Fresh added..
                        List<System.Windows.Forms.ListViewItem> LvItems = new List<System.Windows.Forms.ListViewItem>();
                        for (int i = 0; i < AddedList.Count; i++)
                        {
                            LvItems.Add(new System.Windows.Forms.ListViewItem(AddedList[i] + " has added since last scan.."));
                            FileModel FileItem = new FileModel();
                            using (var md5 = MD5.Create())
                            {
                                using (FileStream stream = File.OpenRead(AddedList[i]))
                                {
                                    FileItem.Size = stream.Length.ToString();
                                    var hash = md5.ComputeHash(stream);
                                    FileItem.Id = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
                                }
                            }
                            FileItem.Name = Path.GetFileName(AddedList[i]);
                            FileItem.Path = AddedList[i];
                            obj.Files.Add(FileItem);
                        }
                    }
                    else if (AddedList.Count == 0 && RemovedList.Count > 0)
                    {
                        //Files removed and non has added which means files have deleted only. Not renamed.
                        for (int i = 0; i < RemovedList.Count; i++)
                        {
                            Console.WriteLine(RemovedList[i] + " has been removed from list since last scan..");
                            obj.Files.RemoveAll(x => x.Path == RemovedList[i]);
                        }
                    }
                    else
                    {
                        //Check for rename situations..

                        //Scan newly added files for MD5 ID's. If they are same with old one that means they are renamed.
                        //if a newly added file has a different MD5 ID that is not represented in old ones this file is fresh added.
                        for (int i = 0; i < AddedList.Count; i++)
                        {
                            string NewFileID = string.Empty;
                            string NewFileSize = string.Empty;
                            using (var md5 = MD5.Create())
                            {
                                using (FileStream stream = File.OpenRead(AddedList[i]))
                                {
                                    NewFileSize = stream.Length.ToString();
                                    var hash = md5.ComputeHash(stream);
                                    NewFileID = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
                                }
                            }
                            FileModel Result = obj.Files.FirstOrDefault(x => x.Id == NewFileID);
                            if (Result == null)
                            {
                                //Not a rename. It's fresh file.
                                Console.WriteLine(AddedList[i] + " has added since last scan..");
                                //Scan new file and add it to the json list.

                            }
                            else
                            {
                                Console.WriteLine(Result.Path + " has renamed into --> " + AddedList[i]);
                                //if file is replaced then it should be removed from RemovedList
                                RemovedList.RemoveAll(x => x == Result.Path);
                                obj.Files.Remove(Result);
                                //After removing old one add new one. This way new one will look like its renamed
                                FileModel ModelToadd = new FileModel();
                                ModelToadd.Id = NewFileID;
                                ModelToadd.Name = Path.GetFileName(AddedList[i]);
                                ModelToadd.Path = AddedList[i];
                                ModelToadd.Size = NewFileSize;
                                obj.Files.Add(ModelToadd);
                            }

                        }

                        //After handle AddedList we should also inform user for removed files 
                        for (int i = 0; i < RemovedList.Count; i++)
                        {
                            Console.WriteLine(RemovedList[i] + " has deleted since last scan.");
                        }
                    }

                    //Update Json after checking everything.
                    obj.LastCheckTime = LastAccess.ToString();
                    File.WriteAllText(FilePath + @"\index.MyIndexFile", EncClass.Encrypt(Newtonsoft.Json.JsonConvert.SerializeObject(obj), "SecretPassword"));


                }
                catch (Exception ex)
                {
                    Console.WriteLine("ERROR : " + ex.Message);
                    Console.WriteLine("Error occured --> " + ex.Message);
                }
                Console.WriteLine("----------- END OF SCAN ----------");
            }
        }
Shino Lex
  • 487
  • 6
  • 22
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/189905/discussion-on-question-by-shino-lex-how-to-get-unique-file-identifier-from-a-fil). – elixenide Mar 12 '19 at 18:43
  • @shino-lex, I've updated my answer below. Please check that again. – Just Shadow Mar 15 '19 at 08:40
  • Did you try to follow [these recommendations](https://learn.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher?view=netframework-4.7.2#events-and-buffer-sizes) when using `FileSystemWatcher` ? – Spotted Mar 15 '19 at 08:58
  • @Spotted Yes I have tried everything on FileSystemWatcher side but still same I get this error : Too many changes at once in directory:xxx even when I set FSW to catch only create (not even changes or anything else) events with maximum buffer. I just paste 1 small txt file for test(7-8kb) and still get this error – Shino Lex Mar 15 '19 at 09:51
  • Can the files be renamed (outside of your application) ? Can a file change more than once in a second ? I'm thinking of a combination of both a filename and a last modified date to detect changes. If the prerequisites are met I will post some code with such involving this. – Spotted Mar 15 '19 at 10:06
  • AFAIK taking last modified date is not working very well over network. The video file will be created and the recording will continue till user stop the recording then this file will be closed, once it's closed I have to fire create event for this file which will flag this "x" video file is ready to use, also I have to fire delete and rename events for this file, there is no chance they can append the video file so I can say I need create,rename(file rename) and delete events to be working on my case – Shino Lex Mar 15 '19 at 10:14
  • It's surprising that FSW with enough buffer size configured misses things. How FSW events should be used is you must only store events like in a thread-safe list and process them in another thread (thread pool, etc.). Also you say you're not always on Windows but so how can you even think about using FSW? – Simon Mourier Mar 15 '19 at 14:46

2 Answers2

8

As to your approach

  1. No guarantee exists that checksum (cryptographic or non) collisions can be avoided, no matter how unlikely.
  2. The more you process of a file, the less likely.
  3. The IO of continually parsing files is incredibly expensive.
  4. Windows knows when files are changing, so it's best to use the provided monitoring mechanism.

FileSystemWatcher has a buffer, its default size is 8192, min 4KB, max 64KB. When events are missed it is typically (in my experience only) because the buffer size is too small. Example code follows. In my test I dropped 296 files into (empty) C:\Temp folder. Every copy resulted in 3 events. None were missed.

using System;
using System.IO;
using System.Threading;

namespace FileSystemWatcherDemo
{
  class Program
  {
    private static volatile int Count = 0;
    private static FileSystemWatcher Fsw = new FileSystemWatcher
    {
      InternalBufferSize = 48 * 1024,  //  default 8192 bytes, min 4KB, max 64KB
      EnableRaisingEvents = false
    };
    private static void MonitorFolder(string path)
    {
      Fsw.Path = path;
      Fsw.Created += FSW_Add;
      Fsw.Created += FSW_Chg;
      Fsw.Created += FSW_Del;
      Fsw.EnableRaisingEvents = true;
    }

    private static void FSW_Add(object sender, FileSystemEventArgs e) { Console.WriteLine($"ADD: {++Count} {e.Name}"); }
    private static void FSW_Chg(object sender, FileSystemEventArgs e) { Console.WriteLine($"CHG: {++Count} {e.Name}"); }
    private static void FSW_Del(object sender, FileSystemEventArgs e) { Console.WriteLine($"DEL: {++Count} {e.Name}"); }
    static void Main(string[] args)
    {
      MonitorFolder(@"C:\Temp\");
      while (true)
      {
        Thread.Sleep(500);
        if (Console.KeyAvailable) break;
      }
      Console.ReadKey();  //  clear buffered keystroke
      Fsw.EnableRaisingEvents = false;
      Console.WriteLine($"{Count} file changes detected");
      Console.ReadKey();
    }
  }
}

Results

ADD: 880 tmpF780.tmp
CHG: 881 tmpF780.tmp
DEL: 882 tmpF780.tmp
ADD: 883 vminst.log
CHG: 884 vminst.log
DEL: 885 vminst.log
ADD: 886 VSIXbpo3w5n5.vsix
CHG: 887 VSIXbpo3w5n5.vsix
DEL: 888 VSIXbpo3w5n5.vsix
888 file changes detected
AlanK
  • 1,827
  • 13
  • 16
  • My target machine is a NAS(Network-attached storage) and provider use an unique Linux File System therefore FileSystemWatcher throws Too many changes at once in directory:xxx error even if I set the buffer to maximum size and only set create event. – Shino Lex Mar 15 '19 at 09:54
  • @ShinoLex Then I suggest checking some easy-to-access-quickly property/ies (like modification time and size) and when anything seems to have changed, using an expensive-to-calculate property (like MD5) to confirm. No matter how good the hash, it is never guaranteed unique (which is why file compare utilities offer timestamp, CRC etc and - as a last resort - binary comparison). If many hundreds of files can change in a very short space of time, then reading every file every 5 seconds to calculate an MD5 is likely to kill your network and/or NAS performance anyway. – AlanK Mar 15 '19 at 11:10
  • Okay then I will use a hybrid system but I still can't manage how can I done this? Should I first take both file's CRC results and compare them? If they are same then I should go with file sizes and if they are STILL same then I should go with MD5 hashing to make sure they are same file? Will this work better? I'm not sure yet( I just think about this while I write this comment) but maybe I can set something into file properties? Like videofile metas? Can I access and gave an unique number to the file I have scanned? So I can reach that property on next scan and check it? – Shino Lex Mar 15 '19 at 11:16
  • If file size changes the file has changed, so no need for a hash/checksum, just deal with the change. Otherwise, if timestamp changes (same size) the file contents *might* have changed, then possibly confirm with checksum/hash. As far as I know image/video tags are stored inside the file, so you would be changing the file in order to monitor it. Keeping a local dictionary to compare against is probably the way to go. – AlanK Mar 15 '19 at 11:49
  • But I need something that will stick with that file. If I use some sort of dictionary and make a table myself then I can't match renames. Lets say there was 1.txt after 5 second(next scan) there is no more 1.txt instead there is 2.txt, now in order to know if its renamed(1.txt-->2.txt) or 1.txt removed and a new file named 2.txt added I need something that comes within the file? am I right here? correct me if I'm wrong please – Shino Lex Mar 15 '19 at 12:04
  • If you mean "stick with that file's contents" then you are right (i.e. a rename is not a file change). Another issue with using a checksum of contents as a unique identifier is distinguishing between two files with the same contents (and therefore the same checksum). I don't know any way of editing the remote file system's properties for a file, but perhaps you can find a NAS that provides an audit trail of file system activity that you can monitor? – AlanK Mar 15 '19 at 12:30
  • Unfortunately I'm not able to tell which NAS to purchase to customer, why is renaming not a file change event? Does that mean there is another ways to detect rename situations? Because if there is I can handle add and remove events with my current code and change my else block(Rename block) – Shino Lex Mar 15 '19 at 12:39
  • A checksum/hash of the file's contents will not change when the file is renamed. When checking the folder every few seconds/minutes, a rename would not be distinguishable from a move (delete+create). To detect content changes you might need a very expensive checksum/hash but to detect delete+create/move/rename you just need to detect one name disappearing and another one appearing. – AlanK Mar 15 '19 at 13:14
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/190093/discussion-between-shino-lex-and-alank). – Shino Lex Mar 15 '19 at 13:37
  • Guys though there is "No guarantee exists that checksum (cryptographic or non) collisions can be avoided" but the probability that you might have 2 different files with the same hash is EXTREMELY low: https://crypto.stackexchange.com/a/18337 unless you don't want to intentionally generate a file with the same hash. Moreover, the generated file be just a bunch of bytes which has no useful content. Thus, the chances that you might have 2 files with the same hash AND valid content would be kind of super-extremely low. So you can go with such algorithms without worrying about that. – Just Shadow Mar 17 '19 at 16:45
  • @JustShadow The OP asked specifically "I was wondering if it makes sense if I just get the first "x" amount of bytes from the stream and do the hashing for unique ID with that lowered-size stream?". The message I've tried to convey is that full/proper hashing is expensive and shortcut hashing increases the likelihood of collision and/or of missing meaningful changes that occur beyond the x-bytes cut-off point. This is not a crypto problem, the OP wants to be able to detect any change - either to content and/or a change of filename (without change of contents). – AlanK Mar 18 '19 at 08:32
1

You might consider using CRC checksums instead which work much faster.
Here is how to calculate CRC64 checksum with C#:

Crc64 crc64 = new Crc64();
String hash = String.Empty;

using (FileStream fs = File.Open("c:\\myBigFile.raw", FileMode.Open))
  foreach (byte b in crc64.ComputeHash(fs)) hash += b.ToString("x2").ToLower();

Console.WriteLine("CRC-64 is {0}", hash);

This calculated the checksum of my 4GB file within few seconds.

Note:
Checksums are not as unique as hashes like MD5/SHA/....
So, in case of many files you might consider crafting some hybrid solution of checksums and hashes. Possible solution might be calculating checksum first, if they match, only then calculate MD5 to make sure if they are the same or not.

P.S. Also check this answer for more info about checksums vs usual hashcodes.

Just Shadow
  • 10,860
  • 6
  • 57
  • 75
  • I will edit my post and will write some of my code. Maybe we can figure it out somehow with the current code? – Shino Lex Mar 15 '19 at 10:25
  • Reading a 40 GB file is incredibly slow regardless of the algorithm you use to calculate the content based checksum. – dymanoid Mar 15 '19 at 10:28
  • Agree. Reading 40GB file from storage is surely a slow operation. But here by saying much faster I'm talking about the calculation speed CRC against MD5/SHA/... – Just Shadow Mar 15 '19 at 11:06
  • Can I use something that will stick to file? Like metadata or property? I mean the properties-details page by property. This will indeed change the file size but now I have a real unique ID for each of my files? – Shino Lex Mar 15 '19 at 12:05
  • @ShinoLex The answer to your question depends on the file format you have. If you have images like jpg and png, this would work. Simply generate a GUID and attach to the end of those files and then every time instead of checking hashes, check the last bytes where you have the GUID. For music, adding data to the end of the file might work, but NOT always, since it's depends on the codecs and format. So, practically yes, you can do that. Just make sure your files are still readable after that. – Just Shadow Mar 17 '19 at 17:16
  • Downvoters please describe the reason of downvoting so I could improve my answer. Or just restore the vote if I've answered your concerns in comments. – Just Shadow Mar 17 '19 at 17:18
  • Thank you for your comment @JustShadow, I can clearly see what can I do now. I don't need to find unique file identifier for my files. I will simply rename them and save it to my index file with renamed values. Since I send the filename to our program I will send file name with the unique number end of the file and the display name will be remain. I will use a unique delimeter string for ensuring I get rid of the numbers from display name. I can't believe I didn't think about this earlier.. Now I just need to make changes in main product about the extended unique number end of file – Shino Lex Mar 18 '19 at 06:56
  • 1
    @JustShadow I wanted to thank you again for the way you show me. It enlighten me. I have add my unique ID end of the files and report the file with my own unique ID and send display name w/o the ID. Now I can clearly seperate if the file is added/deleted or renamed.. This might be not a perfect solution but for now it's working. Atleast for me.. – Shino Lex Mar 21 '19 at 08:52
  • @JustShadow What library is Crc64 in? – Gina Marano Jan 15 '21 at 04:59
  • There are many, but [this](https://github.com/damieng/DamienGKit/blob/master/CSharp/DamienG.Library/Security/Cryptography/Crc64.cs) one is the simplest. – Just Shadow Jan 15 '21 at 08:57