Efficiently searching a massive file for a string in C#

Question

I am building an app that scans files by comparing hashes. I need to search over 1GB of hashes for the hash of a file. I found other solutions for this, such as Aho-Corasick, but it was slower than File.ReadLines(file).Contains(str).

This is the code that is the fastest so far, using File.ReadLines. It takes about 8 seconds to scan one file, versus around 2 minutes to scan one file using Aho-Corasick. I cannot read the entire hash file into memory for obvious reasons.

IEnumerable<DirectoryInfo> directories = new DirectoryInfo(scanPath).EnumerateDirectories();
IEnumerable<FileInfo> files = new DirectoryInfo(scanPath).EnumerateFiles();

FileInfo hashes = new FileInfo(hashPath);
await Task.Run(() =>
{
    IEnumerable<string> lines = File.ReadLines(hashes.FullName);
    
    foreach (FileInfo file in files) {
        if (!AuthenticodeTools.IsTrusted(file.FullName))
        {
            string hash = getHash(file.FullName);
            if (lines.Contains(hash)) flaggedFiles.Add(file.FullName);
        }
        filesScanned += 1;
    }
});
foreach (DirectoryInfo directory in directories)
{
    await scan(directory.FullName, hashPath);
    directoriesScanned += 1;
}

Edit: Per request, here are examples of the file's content:

5c269c9ec0255bbd9f4e20420233b1a7
63510b1eea36a23b3520e2b39c35ef4e
0955924ebc1876f0b849b3b9e45ed49d

They are MD5 hashes.

You should measure how much time is for reading the file and how much time is for searching the string. No algorithm can be faster than your hard disk. To do it, for example simply remove the `lines.Contains` line of code. — xanatos, Jan 31 '21 at 20:26
Have you taken a look at some of the answers [here](https://stackoverflow.com/questions/13959429/c-sharp-searching-large-text-file)? There are many performance "tricks" you can employ here, but I'm very hesitant to post high performance C# code as it's often a trade off (readability, unsafe, memory usage, etc...). And I almost always get the response "this slower method is fine, yours is overkill". My very first starting point would be a performance profiler, before anything else. — Zer0, Jan 31 '21 at 20:28
@xanatos My hard disk shouldn't be a problem. Without the `lines.Contains` it scans at around 100 files per second. (plus it's an SSD) — Infiniwave, Jan 31 '21 at 20:30
I'll say that you should reverse the code... As you wrote the code the 1gb hash files is reread FOR EACH FILE. You could first enumerate all the files, calc the hash of each name, put both of these informations (name + hash) in a dictionary, and THEN compare it with the hash list — xanatos, Jan 31 '21 at 20:32
Could you include in your question the first few lines of the `hashes` file? — Theodor Zoulias, Jan 31 '21 at 20:34
Or you could really load the hash file in memory... 1gb on disk is less than 500mb in memory if done well (because on disk the hashes are in hex format, while in memory you would save them in binary format) — xanatos, Jan 31 '21 at 20:37
If you turn `lines` into a `Set`, you would get constant lookup. — sardok, Jan 31 '21 at 20:37
Thanks. And how many hashes are stored inside the `hashes` file? — Theodor Zoulias, Jan 31 '21 at 20:37
Sort the hashes in the file and do a binary search on them. The hashes are all the same length, so you can do a straight index lookup by using multiplication. It works out max 25 lookups each file. Also, store the hashes as binary, not text, and remove the newlines, that cuts down size by over half — Charlieface, Jan 31 '21 at 20:44
And what is the limit of the RAM you are allowed to allocate during the search? For example 200 MB are OK or too much? — Theodor Zoulias, Jan 31 '21 at 20:47
We don't need to allocate much if we are doing a binary search. Each hash can be compressed in binary to 16 bytes. So we only need a buffer that big — Charlieface, Jan 31 '21 at 20:48
@Charlieface I think that the intended use of the comments section is to help clarify the question, not to answer it. — Theodor Zoulias, Jan 31 '21 at 20:52
@TheodorZoulias We are in the stage of advanced prototyping, even called "throwing s##t to the wall and seeing what sticks better" :-) — xanatos, Jan 31 '21 at 20:55
Read lines into bytes which will take less memory than you strings. For algorithm see : https://en.wikipedia.org/wiki/MD5 — jdweng, Jan 31 '21 at 22:38

Charlieface · Answer 1 · 2021-02-07T01:25:02.310

5

As the hashes are fixed at 32 hex digits (16 bytes), they should be stored in binary format as such, with no spaces. We can do a straight seek on each hash with simple multiplication.

If we then sort the hashes in the file in order, we can speed this up by doing a binary search for each hash.

Ordering can be done using the CompareHashes function below as a compare function.

Once we have done that we can do a binary search.

Binary search is a simple algorithm that searches through a sorted list. It has O(log₂ n) complexity, so, for the number of hashes you have, it would only require at most around 25 lookups. The algorithm is as follows:

Start in the middle.
If the item we're looking for is there then good.
If it is earlier, change the high point to search to be one before this one. Divide difference by two, and loop back to step 2.
If it is later, change the low point to search to be one after this one. Divide difference by two, and loop back to step 2.
If we get to the last one, then we can't find the item.

(I've lifted and modified some of the code from ArraySortHelper in .Net Framework for this.)

public static bool ContainsHash(FileStream hashFile, byte[] hash)
{
    const long hashSize = 16;
    var curHash = new byte[hashSize];
    long lo = 0;
    long hi = hashFile.Length / hashSize - 1;
    while (lo <= hi)
    {
        long i = lo + ((hi - lo) >> 1);
        hashFile.Read(curHash, i * hashSize, hashSize);

        int order = CompareHashes(curHash, hash);
 
        if (order == 0) return true;
        if (order < 0)
        {
            lo = i + 1;
        }
        else
        {
            hi = i - 1;
        }
    }
    return false;
}

public static int CompareHashes(byte[] b1, byte[] b2)
{
    var comp = 0;
    for (int i = 0; i < b1.Length; i++)
    {
        comp = b1[i].CompareTo(b2[i]);
        if(comp != 0) return comp;
    }
    return comp;
}

We only need to open the file of hashes once, and pass to the function the FileStream for the hashes, plus a a hash to compare.

I may have some slight errors as I have not tested it. I would love for others to please test and edit this answer.

edited Feb 07 '21 at 01:25

answered Jan 31 '21 at 21:28

Charlieface

52,284
6
19
43

If the code is correct it is a good idea. The only problem is keeping the hash file sorted. Problem: ordering a file full of hashes requires loading the whole file in memory (or doing mergesort, but it is a pain to implement) – xanatos Jan 31 '21 at 21:30
Hmm... if the hash file needs to be sorted I can write a simple node script to sort it. – Infiniwave Jan 31 '21 at 21:32
@ComputerGeek12 If the file is "fixed" (or rarely changes) then you can reorder once (or rarely) and live happy. – xanatos Jan 31 '21 at 21:33
For that you need the whole thing in RAM though. [Binary search is not a difficult algorithm to get the hang of](https://www.geeksforgeeks.org/binary-search/). Just make sure you take into account the size of the hash for multiplying out the `Read` – Charlieface Jan 31 '21 at 21:38
If you are going to use the disk/SSD for storing and searching the hashes, I think it would make sense to use a proper database (for example SQLite), and put all hashes in one table having a single indexed field. – Theodor Zoulias Feb 01 '21 at 00:24
1

Depends what else you are doing. If this is the only usecase in the application then overkill IMO. – Charlieface Feb 01 '21 at 00:27
It also depends on the performance requirements. Querying an indexed table of an in-process database should be faster than seeking randomly to a file, especially if the storage hardware is a classic hard disk. I haven't done any performance measurements though, and I may be wrong. – Theodor Zoulias Feb 01 '21 at 00:36
I think operating system buffers would hold a certain amount of recently used pages anyway, but given than Binary Search is an `O(log2 n)` complexity, it only takes 25 lookups maximum for this amount of hashes. Even HDDs would make quick work of that. I suppose you could hash the hashes (!) and store that in RAM to speed it up, but wouldn't save much unless you're comparing a lot of files – Charlieface Feb 01 '21 at 00:40
Assuming an HDD with an [access time](https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics) of ~10ms, searching one hash using the binary search method would take a quarter of a second on average. Which is not that great if you have thousands of hashes to search. – Theodor Zoulias Feb 01 '21 at 00:51
You could probably improve performance further with MemoryMappedFile instead of FileStream, especially for repeated queries. – Simon Feb 01 '21 at 00:53
@TheodorZoulias Using databases to do this is a valid performance point and worth consideration, but outside the scope. Same reason I didn't bring up memory mapped files. Most questions presume the data fits in memory, and this does. It's small. – Zer0 Feb 01 '21 at 00:55
2

@Zer0 I could claim that Charlieface's solution is nothing more than a tiny home-made database, consisting of a single table with one field. If this answer is in the scope of the OP's question, then so are memory mapped files and [Disk Based Data Structures](https://archive.codeplex.com/?p=mmf) and whatnot. – Theodor Zoulias Feb 01 '21 at 01:02
1

@Zer0 I was actually thinking of MM files, but doesn't really help as we'd constantly need to move the window, we're trying to conserve RAM so can't hold the whole thing. Bear in mind we have no data on the ratio of successful to unsuccessful matches, so that will change things. Plus, we have not taken into account disk buffering, usually in the 10s MBs, and OS buffering also. **Plus, more importantly, the last few lookups will be very close together,** from about lookup 17 we are at less than 4KB so within a typical block size. – Charlieface Feb 01 '21 at 01:04
@TheodorZoulias Go ahead and post that as a separate answer, it's my personal opinion that was all :-) – Charlieface Feb 01 '21 at 01:05
I can't answer this question as long as critical information is missing. I've asked the OP for the maximum amount of RAM that an acceptable solution could allocate, and up to now the OP has not replied. Your solution allocates practically zero RAM, which is nice and stylish, but also probably over-conservative. Or, who knows?, may be exactly what the OP wants. – Theodor Zoulias Feb 01 '21 at 08:03
@ComputerGeek12 Better to change it all to `long`. If this worked for you, you can accept it by clicking the check mark. – Charlieface Feb 07 '21 at 01:27

score 0 · Answer 2 · answered Mar 03 '21 at 02:04

It seems like you will process all of the files in the directory so why don't you change your approach. First, fill a dictionary of all of the files that aren't trusted with something like:

var hashDict = files.Where(fi => !IsTrusted(fi.FullName))
                    .ToDictionary(fi=>fi.FullName,fi=>getHash(fi.FullName));

Now that you have a list of the hashes to check, pass them into a method that gets the flagged files.

using(var stream = File.OpenRead(hashPath) )
{
    var flaggedFiles = GetHashesInStream(stream, hashDict);
    // Do whatever you need to do with the list.
}

Here is the search method:

private static List<string> GetFilesWithMatchingHashes(Stream s, Dictionary<string,string> hashes)
{
    var results = new List<string>();
    var bufsize = (1024 * 1024 / 34)*34; // Each line should be 32 characters for the hash and 2 for cr-lf
                                         // Adjust if this isn't the case
    var buffer = new byte[bufsize];
    s.Seek(0, SeekOrigin.Begin);

    var readcount = bufsize;
    var keyList = hashes.Keys.ToList();
    while (keyList.Count > 0 && (readcount = s.Read(buffer, 0, bufsize)) > 0)
    {
        var str = Encoding.ASCII.GetString(buffer, 0, readcount);
        for (var i = keyList.Count - 1; i >= 0; i--)
        {
            var k = keyList[i];
            if (str.Contains(hashes[k]))
            {
                results.Add(k);
                keyList.RemoveAt(i);
            }
        }
    }
    return results; // This should contain a list of the files with found hashes.
}

The benefit of this solution is that you will only scan through the file once. I did some testing where I searched for the last hash in a file of 1,020,000,000 bytes. Just searching for one hash was more than twice as fast than your readlines method. Getting them all at once should be much faster.

Efficiently searching a massive file for a string in C#

2 Answers2

Linked