I am relatively new to c#. I am currently learning on hashset and understand that hashset do not allow an identical element inside the hashset. My question is am i able to hash the files in my directory with md5 hash and store them in a hashset so that i can check for the duplicated hash or files in a way?
-
@Viv there is n method for Encryption and Decryption in C#. snippet your code if you stuck anywhere – Udal Pal Aug 14 '19 at 06:35
-
Beware of [pigeonholes](https://blog.codinghorror.com/hashtables-pigeonholes-and-birthdays/). Also a hashset and MD5 are similar but different. Checking for equality is a two step workflow. First, you check if the hash of two things is equal. If it is not, you can be pretty sure, the things are different. But you can **NOT** be sure that they are equal. For that, you would need to perform a deeper equality check. For files, you'd probably have to compare each byte. So hashing is used to quickly weed out a lot of definite "not equals". – Corak Aug 14 '19 at 06:47
-
See also [Guidelines and rules for GetHashCode](https://blogs.msdn.microsoft.com/ericlippert/2011/02/28/guidelines-and-rules-for-gethashcode/) – Corak Aug 14 '19 at 06:47
-
@John MD5 produces a 128-bit hash value. (same size as guid) so it would be extremely unlikely that two different files would generate the same hash. – Magnus Aug 14 '19 at 08:39
-
@Magnus Indeed it's unlikely, but it's an important distinction, especially if OP applies the same logic to hashes that result in smaller results in future. – ProgrammingLlama Aug 14 '19 at 08:42
-
@John so unlikely that if processing 6 billion files per second for 100 years you might have a collision. – Magnus Aug 14 '19 at 08:45
1 Answers
HashSet<T>
Class is a collection of unique elements. The namespace for the HashSet class is System.Collections.Generic. It was introduced in .NET 3.5.
Let take an example with files:
static void Main(string[] args)
{
HashSet<string> FileData = new HashSet<string>();
using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead("C:\\FolderTest\\Document.txt"))
{
var hash = md5.ComputeHash(stream);
var data = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
FileData.Add(data);
}
using (var stream = File.OpenRead("C:\\FolderTest\\Document.txt"))
{
var hash = md5.ComputeHash(stream);
var data = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
FileData.Add(data);
}
using (var stream = File.OpenRead("C:\\FolderTest\\Document2.txt"))
{
var hash = md5.ComputeHash(stream);
var data = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant();
FileData.Add(data);
}
}
foreach (var file in FileData)
{
Console.WriteLine(file);
}
Console.ReadKey();
}
In the code above we create a simple HashSet type of HashSet<string>
and adding the string to it.
Given above, even though we try to add a duplicate string of hash data, we will not get any error but when we iterate the collection, we cannot find the string.
How you compare the results afterward is up to you; you can convert the byte array to base64 for example, or compare the bytes directly. (Just be aware that arrays don't override Equals. Using base64 is simpler to get right, but slightly less efficient if you're really only interested in comparing the hashes.) see these answers
Characteristics of HashSet:
- When we add elements to
HashSet<T>
automatically increases the capacity of HashSet. - It is used in a situation where we want to prevent duplicates from being inserted in the collection.
- HashSet provides many mathematical set operations, such as set addition (unions) and set subtraction.

- 588
- 6
- 15