Better Search for a string in all files using C#

Question

After referring many blogs and articles, I have reached at the following code for searching for a string in all files inside a folder. It is working fine in my tests.

QUESTIONS

Is there a faster approach for this (using C#)?
Is there any scenario that will fail with this code?

Note: I tested with very small files. Also very few number of files.

CODE

static void Main()
    {
        string sourceFolder = @"C:\Test";
        string searchWord = ".class1";

        List<string> allFiles = new List<string>();
        AddFileNamesToList(sourceFolder, allFiles);
        foreach (string fileName in allFiles)
        {
            string contents = File.ReadAllText(fileName);
            if (contents.Contains(searchWord))
            {
                Console.WriteLine(fileName);
            }
        }

        Console.WriteLine(" ");
        System.Console.ReadKey();
    }

    public static void AddFileNamesToList(string sourceDir, List<string> allFiles)
    {

            string[] fileEntries = Directory.GetFiles(sourceDir);
            foreach (string fileName in fileEntries)
            {
                allFiles.Add(fileName);
            }

            //Recursion    
            string[] subdirectoryEntries = Directory.GetDirectories(sourceDir);
            foreach (string item in subdirectoryEntries)
            {
                // Avoid "reparse points"
                if ((File.GetAttributes(item) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    AddFileNamesToList(item, allFiles);
                }
            }

    }

REFERENCE

This might be more suited to the CodeReview site: http://codereview.stackexchange.com — Simon Martin, Dec 21 '12 at 16:18
Did you test with a file larger than your available RAM and swap? — Dark Falcon, Dec 21 '12 at 16:19
Well, at least it's slow (or it won't work) with very large files. Moreover if you have a very large number of files it'll hang too (because you create the whole list before you start searching). — Adriano Repetti, Dec 21 '12 at 16:21
EnumerateFiles (http://msdn.microsoft.com/en-us/library/system.io.directory.enumeratefiles.aspx) to scan the directory step by step and (if you have to handle very large text files) a better search algorithm too (http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm, for example). — Adriano Repetti, Dec 21 '12 at 16:23

score 31 · Accepted Answer · answered Dec 21 '12 at 16:20

31

Instead of File.ReadAllText() better use

File.ReadLines(@"C:\file.txt");

It returns IEnumerable (yielded) so you will not have to read the whole file if your string is found before the last line of the text file is reached

answered Dec 21 '12 at 16:20

VladL

12,769
10
63
83

score 11 · Answer 2 · edited May 23 '17 at 11:54

I wrote somthing very similar, a couple of changes I would recommend.

Use Directory.EnumerateDirectories instead of GetDirectories, it returns immediately with a IEnumerable so you don't need to wait for it to finish reading all of the directories before processing.
Use ReadLines instead of ReadAllText, this will only load one line in at a time in memory, this will be a big deal if you hit a large file.
If you are using a new enough version of .NET use Parallel.ForEach, this will allow you to search multiple files at once.
You may not be able to open the file, you need to check for read permissions or add to the manifest that your program requires administrative privileges (you should still check though)

I was creating a binary search tool, here is some snippets of what I wrote to give you a hand

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories), Search);
}

//_array contains the binary pattern I am searching for.
private void Search(string filePath)
{
    if (Contains(filePath, _array))
    {
        //filePath points at a match.
    }
}

private static bool Contains(string path, byte[] search)
{
    //I am doing ReadAllBytes due to the fact that I am doing a binary search not a text search
    //  There are no "Lines" to seperate out on.
    var file = File.ReadAllBytes(path);
    var result = Parallel.For(0, file.Length - search.Length, (i, loopState) =>
        {
            if (file[i] == search[0])
            {
                byte[] localCache = new byte[search.Length];
                Array.Copy(file, i, localCache, 0, search.Length);
                if (Enumerable.SequenceEqual(localCache, search))
                    loopState.Stop();
            }
        });
    return result.IsCompleted == false;
}

This uses two nested parallel loops. This design is terribly inefficient, and could be greatly improved by using the Booyer-Moore search algorithm but I could not find a binary implementation and I did not have the time when I wrote it originally to implement it myself.

score 3 · Answer 3 · edited May 23 '17 at 12:17

the main problem here is that you are searching all the files in real time for every search. there is also the possibility of file access conflicts if 2+ users are searching at the same time.

to dramtically improve performance I would index the files ahead of time, and as they are edited/saved. store the indexed using something like lucene.net and then query the index (again using luence.net) and return the file names to the user. so the user never queries the files directly.

if you follow the links in this SO Post you may have a head start on implementing the indexing. I didn't follow the links, but it's worth a look.

Just a heads up, this will be an intense shift from your current approach and will require

a service to monitor/index the files
the UI project

Most of the time a real-time file search is exactly what I want. As a programmer most things I'm looking for are in small text files with known extensions. I don't want these large indexes hogging RAM and chewing on my disk while I'm working. My disk is fast for a reason. — Brannon, Dec 22 '12 at 05:23

score 1 · Answer 4 · edited Dec 21 '12 at 16:26

1

I think your code will fail with an exception if you lack permission to open a file.

Compare it with the code here: http://bgrep.codeplex.com/releases/view/36186

That latter code supports

regular expression search and
filters for file extensions

-- things you should probably consider.

edited Dec 21 '12 at 16:26

LCJ

22,196
67
260
418

answered Dec 21 '12 at 16:22

Brannon

5,324
4
35
83

Typcially, when searing in files, you want to be able to throw in a few wildcards in the search string. The above code is looking for a hard-coded ".class1". That's fine if that was the intent. However, that's the kind of thing you would want to parameterize -- how can a person reuse that fancy code with hard-coded inputs? Regular expressions take that a step farther: they increase the power of the tool. There are plenty of accolades and helps for Regular Expressions on the web. Look it up if you're not familiar with the concept. – Brannon Dec 22 '12 at 06:16

score 1 · Answer 5 · answered Dec 21 '12 at 16:34

1

Instead of Contains better use algorithm Boyer-Moore search.
Fail scenario: file have not read permission.

answered Dec 21 '12 at 16:34

Serj-Tm

16,581
4
54
61

3

Using IndexOf with [StringComparison.Ordnal or OrdnalIgnoreCase](http://msdn.microsoft.com/en-us/library/system.stringcomparison.aspx) is [more performant](http://stackoverflow.com/questions/4904705/boyer-moore-practical-in-c) than Boyer-Moore. and `Contains` calls [IndexOf(value, StringComparison.Ordinal)](http://stackoverflow.com/a/498722/80274) – Scott Chamberlain Dec 21 '12 at 16:40

Better Search for a string in all files using C#

5 Answers5

Linked