3

I'm trying to write a search program in C# which will search for a string in a large text file(5GB). I've done a simple code shown below but the search results are time consuming and can take about 30 mins to complete. This is how my code looks like-

public List<string> Search(string searchKey)
{
    List<string> results = new List<string>();
    StreamReader fileReader = new StreamReader("D:\Logs.txt");
    while ((line = fileReader.ReadLine()) != null)
    {
        if (line.Contains(searchKey)
        {
            results.Add(line);
        }
    }
}

Although the code works, it runs very slowly and takes about 30 mins to complete. Can we do something to bring the search time under a minute?

Palle Due
  • 5,929
  • 4
  • 17
  • 32
Dhawal Dhingra
  • 145
  • 1
  • 2
  • 9
  • Does this answer your question? [Fastest way to search string in large text file](https://stackoverflow.com/questions/37979457/fastest-way-to-search-string-in-large-text-file) – styx Mar 16 '20 at 13:48
  • 2
    Are you only searching the one file for all matches for one string ? If so, it seems unlikely you can speed it up much. Still, 30 mins to search one 5GB file for all matches for one string seems really excessive. Is it running over a network connection? How many lines match? – Matthew Watson Mar 16 '20 at 13:52
  • FYI, syntactically you can make this code a little cleaner: `var results = File.ReadLines("D:\Logs.txt").Where(line => line.Contains(searchKey)).ToList();`. – maccettura Mar 16 '20 at 14:36
  • @MatthewWatson - yes, I'm searching one file for all the matches for one string. There can be between 0 to 50 matches with about 5 on an average. And yes, I'm running it over a network connection. – Dhawal Dhingra Mar 16 '20 at 14:54
  • I suggest timing how long it takes to just do the file reading part of the loop (i.e. comment-out the `if (line.Contains(searchKey)) { results.Add(line); }` part). That will give you a lower-bound on the time taken. I'm guessing you'll discover that almost all the time is taken reading from the network. – Matthew Watson Mar 16 '20 at 15:06

3 Answers3

0

For string search in a really large file, you can use the Boyer Moore Search Algorithm, which is the standard benchmark for practical string-search literature. For its implementation, links are given below:

  • I am pretty sure that ReadLines + Contains beat any hand crafted algo. https://github.com/dotnet/coreclr/blob/775003a4c72f0acc37eab84628fcef541533ba4e/src/utilcode/newapis.cpp#L577 – Drag and Drop Mar 16 '20 at 14:39
0

File indexing is implemented in the library Bsa.Search.Core

You can implement your own version of reading from a file. FileByLinesRowReader - read file by line and add documents with externalId equal line number. FileDocumentIndex has been tested on wiki data json dictionary

  • .Net Core

  • .Net 472

         var selector = new IndexWordSelector();
         var morphology = new DefaultMorphology(new WordDictionary(), selector);
         var fileName = "D:\Logs.txt";
    
         // you can implement your own file reader, csv, json, or other
         var index = new FileDocumentIndex(fileName, new FileByLinesRowReader(null), morphology);
    
         // if index is already exist we skip file indexing
             if (!index.IsIndexed)
         index.Start();
         while (!index.IsReady)
         {
             Thread.Sleep(300);
         }
    
         var query = "("one" | two) ~50 ("error*")".Parse("*");
         var found = index.Search(new SearchQueryRequest()
         {
             Field = "*",
             Query = query,
             ShowHighlight = true,
         });
         // where ExternalId is line number from file  
         //found.ShardResult.First().FoundDocs.FirstOrDefault().Value.ExternalId
    
Stanislav
  • 459
  • 3
  • 6
0

Gigantor provides a RegexSearcher that can do this. I tested your example with a 32 GB file I have laying around. It took less than 20 seconds on my MacBook Pro. Code shown below.

Gigantor boosts the performance of regular expressions and works with gigantic files. Here's what the code for your Search function would look like using Gigantor.

public List<string> Search(string path, string searchKey)
{
    // Create regex to search for the searchKey
    System.Text.RegularExpressions.Regex regex = new(searchKey);
    List<string> results = new List<string>();

    // Create Gigantor stuff
    System.Threading.AutoResetEvent progress = new(false);
    Imagibee.Gigantor.RegexSearcher searcher = new(
        path, regex, progress, maxMatchCount: 10000);

    // Start the search and wait for completion
    Imagibee.Gigantor.Background.StartAndWait(
        searcher,
        progress,
        (_) => { },
        1000);

    // Check for errors
    if (searcher.Error.Length != 0) {
        throw new Exception(searcher.Error);
    }

    // Open the searched file for reading
    using System.IO.FileStream fileStream = new(path, FileMode.Open);
    Imagibee.Gigantor.StreamReader reader = new(fileStream);

    // Capture the line of each match
    foreach (var match in searcher.GetMatchData()) {
        fileStream.Seek(match.StartFpos, SeekOrigin.Begin);
        results.Add(reader.ReadLine());
    }
    return results;
}

Here's the test code.

[Test]
public void SearchTest()
{
    var path = Path.Combine(Path.GetTempPath(), "enwik9x32");
    Stopwatch stopwatch = new();
    stopwatch.Start();
    var results = Search(path, "unicorn");
    stopwatch.Stop();
    Console.WriteLine($"found {results.Count} results in {stopwatch.Elapsed.TotalSeconds} seconds");
}

Here's the console output

found 8160 results in 19.1458573 seconds

And here's the Gigantor source repo. I know its a little late but hopefully this answer is helpful to someone.

dynamicbutter
  • 311
  • 1
  • 5