-3

I have written a web application with C# language, one of its features is file attachment, which can be with extensions such as video, photo, document, etc., these files are sometimes repetitive and sometimes they are bulky, which over time They take up a lot of space, but on the other hand, I want the program to work smarter so that it can recognize which files are duplicates, use previous information, and even report on frequently used files.

For example I wrote a extension method that combine file content and file MIME type and create unique string:

public static string CreateFileKey(this Stream file, string mimeType)
{
    if (file is null || file.Length == 0)
        throw new ArgumentNullException(nameof(file));
    if (string.IsNullOrWhiteSpace(mimeType))
        throw new ArgumentNullException(nameof(mimeType));

    file.Seek(0, SeekOrigin.Begin);
    using var hashAlgorithm = MD5.Create();
    using var bufferedFile = new BufferedStream(file);
    var hashedFile = hashAlgorithm.ComputeHash(bufferedFile);

    var mimeTypeBytes = Encoding.ASCII.GetBytes(mimeType);
    var trustedDataForHashing = mimeTypeBytes.Concat(hashedFile).ToArray();

    var result = hashAlgorithm.ComputeHash(trustedDataForHashing);
    return Convert.ToBase64String(result);
}

Now, I will first check whether a file has been saved with this key or not! We will decide to save the file later.

Is it a good solution to use one of the hash algorithms to generate a unique value for each file?

BQF
  • 33
  • 7
  • One of _which_ hash algorithms? – Mathias R. Jessen Aug 28 '23 at 14:26
  • Please [edit] your question to have it ask a single question. Note that the first question depends a bit too much on opinion to be fit the Stack Overflow model. I would just ask the second question, perhaps being more specific about what requirements you have for the algorithm. People will let you know if you're on the wrong track. – Heretic Monkey Aug 28 '23 at 14:26
  • 1
    Hash values will always be susceptible to collisions (i.e. two different files producing the same hash). You can improve duplicate detection by computing two hashes, also comparing files lengths of course, before resorting to byte-per-byte comparison – d-markey Aug 28 '23 at 14:27
  • Expand your idea to a ["Bloom filter"](https://en.wikipedia.org/wiki/Bloom_filter) – Alexey S. Larionov Aug 29 '23 at 10:01
  • @AlexeyS.Larionov thanks – BQF Aug 29 '23 at 10:23

1 Answers1

0

Yes, it's a good idea to use the lightweight MD5 hash algorithm, or some of the stronger hashing algorithms like the SHA512, to create a signature for each file. Comparing these signatures can tell you with 100% certainty that two files are different, but not that they are identical. There is always a statistically insignificant probability that two different files will generate the same signature. You might have to take into consideration this probability in case, for example, that the files contain sensitive private information that should not be shared.

Also you might have to think about empty files. You might have to maintain multiple empty files with different filenames. Obviously all these files will have equal signature, according to their content.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • If I throw a exception for empty file , it's not solved? – BQF Aug 28 '23 at 14:57
  • 1
    @BQF it depends on who will handle the exception, and how. In general controlling the execution flow with exceptions [is not recommended](https://stackoverflow.com/questions/729379/why-not-use-exceptions-as-regular-flow-of-control "Why not use exceptions as regular flow of control?"). – Theodor Zoulias Aug 28 '23 at 15:01
  • I meant data validation on the user side, and certainly in the body of the server code, a custom error must be raised if invalid information reaches the server. The use of exceptions is quite common and in my opinion, it does not cause any problems, finally, at the highest level of the software layers, we can receive the exceptions and manage them. – BQF Aug 29 '23 at 05:54
  • @BQF well, if exceptions work in your case, go for it. This means that your users won't be able to attach empty files to their emails (or whatever the main entity is). – Theodor Zoulias Aug 29 '23 at 06:23