I am building an app that scans files by comparing hashes. I need to search over 1GB of hashes for the hash of a file. I found other solutions for this, such as Aho-Corasick, but it was slower than File.ReadLines(file).Contains(str)
.
This is the code that is the fastest so far, using File.ReadLines
. It takes about 8 seconds to scan one file, versus around 2 minutes to scan one file using Aho-Corasick. I cannot read the entire hash file into memory for obvious reasons.
IEnumerable<DirectoryInfo> directories = new DirectoryInfo(scanPath).EnumerateDirectories();
IEnumerable<FileInfo> files = new DirectoryInfo(scanPath).EnumerateFiles();
FileInfo hashes = new FileInfo(hashPath);
await Task.Run(() =>
{
IEnumerable<string> lines = File.ReadLines(hashes.FullName);
foreach (FileInfo file in files) {
if (!AuthenticodeTools.IsTrusted(file.FullName))
{
string hash = getHash(file.FullName);
if (lines.Contains(hash)) flaggedFiles.Add(file.FullName);
}
filesScanned += 1;
}
});
foreach (DirectoryInfo directory in directories)
{
await scan(directory.FullName, hashPath);
directoriesScanned += 1;
}
Edit: Per request, here are examples of the file's content:
5c269c9ec0255bbd9f4e20420233b1a7
63510b1eea36a23b3520e2b39c35ef4e
0955924ebc1876f0b849b3b9e45ed49d
They are MD5 hashes.