31

After referring many blogs and articles, I have reached at the following code for searching for a string in all files inside a folder. It is working fine in my tests.

QUESTIONS

  1. Is there a faster approach for this (using C#)?
  2. Is there any scenario that will fail with this code?

Note: I tested with very small files. Also very few number of files.

CODE

static void Main()
    {
        string sourceFolder = @"C:\Test";
        string searchWord = ".class1";

        List<string> allFiles = new List<string>();
        AddFileNamesToList(sourceFolder, allFiles);
        foreach (string fileName in allFiles)
        {
            string contents = File.ReadAllText(fileName);
            if (contents.Contains(searchWord))
            {
                Console.WriteLine(fileName);
            }
        }

        Console.WriteLine(" ");
        System.Console.ReadKey();
    }

    public static void AddFileNamesToList(string sourceDir, List<string> allFiles)
    {

            string[] fileEntries = Directory.GetFiles(sourceDir);
            foreach (string fileName in fileEntries)
            {
                allFiles.Add(fileName);
            }

            //Recursion    
            string[] subdirectoryEntries = Directory.GetDirectories(sourceDir);
            foreach (string item in subdirectoryEntries)
            {
                // Avoid "reparse points"
                if ((File.GetAttributes(item) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
                {
                    AddFileNamesToList(item, allFiles);
                }
            }

    }

REFERENCE

  1. Using StreamReader to check if a file contains a string
  2. Splitting a String with two criteria
  3. C# detect folder junctions in a path
  4. Detect Symbolic Links, Junction Points, Mount Points and Hard Links
  5. FolderBrowserDialog SelectedPath with reparse points
  6. C# - High Quality Byte Array Conversion of Images
Community
  • 1
  • 1
LCJ
  • 22,196
  • 67
  • 260
  • 418
  • 3
    This might be more suited to the CodeReview site: http://codereview.stackexchange.com – Simon Martin Dec 21 '12 at 16:18
  • 3
    Did you test with a file larger than your available RAM and swap? – Dark Falcon Dec 21 '12 at 16:19
  • 1
    Well, at least it's slow (or it won't work) with very large files. Moreover if you have a very large number of files it'll hang too (because you create the whole list before you start searching). – Adriano Repetti Dec 21 '12 at 16:21
  • 3
    EnumerateFiles (http://msdn.microsoft.com/en-us/library/system.io.directory.enumeratefiles.aspx) to scan the directory step by step and (if you have to handle very large text files) a better search algorithm too (http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm, for example). – Adriano Repetti Dec 21 '12 at 16:23

5 Answers5

31

Instead of File.ReadAllText() better use

File.ReadLines(@"C:\file.txt");

It returns IEnumerable (yielded) so you will not have to read the whole file if your string is found before the last line of the text file is reached

VladL
  • 12,769
  • 10
  • 63
  • 83
11

I wrote somthing very similar, a couple of changes I would recommend.

  1. Use Directory.EnumerateDirectories instead of GetDirectories, it returns immediately with a IEnumerable so you don't need to wait for it to finish reading all of the directories before processing.
  2. Use ReadLines instead of ReadAllText, this will only load one line in at a time in memory, this will be a big deal if you hit a large file.
  3. If you are using a new enough version of .NET use Parallel.ForEach, this will allow you to search multiple files at once.
  4. You may not be able to open the file, you need to check for read permissions or add to the manifest that your program requires administrative privileges (you should still check though)

I was creating a binary search tool, here is some snippets of what I wrote to give you a hand

private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    Parallel.ForEach(Directory.EnumerateFiles(_folder, _filter, SearchOption.AllDirectories), Search);
}

//_array contains the binary pattern I am searching for.
private void Search(string filePath)
{
    if (Contains(filePath, _array))
    {
        //filePath points at a match.
    }
}

private static bool Contains(string path, byte[] search)
{
    //I am doing ReadAllBytes due to the fact that I am doing a binary search not a text search
    //  There are no "Lines" to seperate out on.
    var file = File.ReadAllBytes(path);
    var result = Parallel.For(0, file.Length - search.Length, (i, loopState) =>
        {
            if (file[i] == search[0])
            {
                byte[] localCache = new byte[search.Length];
                Array.Copy(file, i, localCache, 0, search.Length);
                if (Enumerable.SequenceEqual(localCache, search))
                    loopState.Stop();
            }
        });
    return result.IsCompleted == false;
}

This uses two nested parallel loops. This design is terribly inefficient, and could be greatly improved by using the Booyer-Moore search algorithm but I could not find a binary implementation and I did not have the time when I wrote it originally to implement it myself.

Community
  • 1
  • 1
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
3

the main problem here is that you are searching all the files in real time for every search. there is also the possibility of file access conflicts if 2+ users are searching at the same time.

to dramtically improve performance I would index the files ahead of time, and as they are edited/saved. store the indexed using something like lucene.net and then query the index (again using luence.net) and return the file names to the user. so the user never queries the files directly.

if you follow the links in this SO Post you may have a head start on implementing the indexing. I didn't follow the links, but it's worth a look.

Just a heads up, this will be an intense shift from your current approach and will require

  1. a service to monitor/index the files
  2. the UI project
Community
  • 1
  • 1
Jason Meckley
  • 7,589
  • 1
  • 24
  • 45
  • 2
    Most of the time a real-time file search is exactly what I want. As a programmer most things I'm looking for are in small text files with known extensions. I don't want these large indexes hogging RAM and chewing on my disk while I'm working. My disk is fast for a reason. – Brannon Dec 22 '12 at 05:23
1

I think your code will fail with an exception if you lack permission to open a file.

Compare it with the code here: http://bgrep.codeplex.com/releases/view/36186

That latter code supports

  1. regular expression search and
  2. filters for file extensions

-- things you should probably consider.

LCJ
  • 22,196
  • 67
  • 260
  • 418
Brannon
  • 5,324
  • 4
  • 35
  • 83
  • Typcially, when searing in files, you want to be able to throw in a few wildcards in the search string. The above code is looking for a hard-coded ".class1". That's fine if that was the intent. However, that's the kind of thing you would want to parameterize -- how can a person reuse that fancy code with hard-coded inputs? Regular expressions take that a step farther: they increase the power of the tool. There are plenty of accolades and helps for Regular Expressions on the web. Look it up if you're not familiar with the concept. – Brannon Dec 22 '12 at 06:16
1
  1. Instead of Contains better use algorithm Boyer-Moore search.

  2. Fail scenario: file have not read permission.

Serj-Tm
  • 16,581
  • 4
  • 54
  • 61
  • 3
    Using IndexOf with [StringComparison.Ordnal or OrdnalIgnoreCase](http://msdn.microsoft.com/en-us/library/system.stringcomparison.aspx) is [more performant](http://stackoverflow.com/questions/4904705/boyer-moore-practical-in-c) than Boyer-Moore. and `Contains` calls [IndexOf(value, StringComparison.Ordinal)](http://stackoverflow.com/a/498722/80274) – Scott Chamberlain Dec 21 '12 at 16:40