0

I have a large text file (over 10 GB) and I need to find all occurrences of a specific string in it. I am currently using a StreamReader and looping through each line to look for the string, but this is very slow and takes hours to complete. What is a more efficient way to search for the string in the file using C#?

I've tried using StreamReader to loop through each line of the text file and using string.Contains() to look for the string. I've also tried splitting the file into smaller chunks and using multiple threads to search each chunk, but this did not improve the performance significantly.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Somewhat related: [How to search for a string in a large text file, using buffers and multithreading](https://stackoverflow.com/questions/67139141/how-to-search-for-a-string-in-a-large-text-file-using-buffers-and-multithreadin) – Theodor Zoulias Feb 17 '23 at 20:11
  • 1
    You could try comparing the performance of reading the file and searching for the specific string, with the performance of just reading the file. In case the difference is small, it means that your bottleneck is the I/O with the storage device, and you can't do really much about it. – Theodor Zoulias Feb 17 '23 at 20:46
  • I am casting a reopen vote because the [linked question](https://stackoverflow.com/questions/37979457/fastest-way-to-search-for-multiple-strings-in-multiple-large-text-files) is about searching in multiple files, while this question is about searching in a single file. A strategy that is valid for solving the one problem might not be valid for solving the other. – Theodor Zoulias Feb 23 '23 at 17:38

1 Answers1

0

Partitioning and multi-threading can noticeably improve the performance. Here is a .NET library I wrote called Gigantor designed to do regular expression searches of large files that uses partitioning and multi-threading (available as source or nuget package). According to my benchmarking you should expect in the neighborhood of an 8x improvement for uncompressed text with a simple regex. I'd like to learn more about your use case if you try this and don't get similar results (or even if you do).

Here is an example for benchmarking single threaded vs multi-threaded performance using Gigantor's RegexSearcher class.

using System.Text.RegularExpressions;
using Imagibee.Gigantor;
using System.Diagnostics;

public void Benchmark(string path, string word)
{
    // Create the dependencies we will need
    Regex regex = new(word, RegexOptions.Compiled);
    AutoResetEvent progress = new(false);
    Stopwatch stopwatch = new();

    // Benchmark results with differing number of threads
    foreach (var numWorkers in new List<int>() { 1, 128 }) {
        Console.WriteLine($"Starting search with {numWorkers} thread(s)");
        stopwatch.Start();

        // Create the searcher
        RegexSearcher searcher = new(
            path,
            regex,
            progress,
            maxMatchCount: 1000,
            maxWorkers: numWorkers);

        // Start and wait for completion
        Background.StartAndWait(
            new List<IBackground>() { searcher },
            progress,
            (_) => { Console.Write("."); },
            1000);
        Console.Write('\n');

        // Display results
        var runTime = stopwatch.Elapsed.TotalSeconds;
        Console.WriteLine($"Completed in {runTime} seconds");
        stopwatch.Reset();
    }
}

About the Benchmarking The benchmarking consists of searching for the Regex of the pattern @"food" over enwik9. Five iterations are ran and the average throughput is used. The benchmarking source code is here. Command and results shown below.

$ dotnet SearchApp/bin/Release/net6.0/SearchApp.dll benchmark ${TMPDIR}/enwik9
........................
maxWorkers=1, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 24.0289207 seconds
-> 208.0825877460239 MBytes/s
..............
maxWorkers=2, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 12.692795 seconds
-> 393.92426963485974 MBytes/s
.........
maxWorkers=4, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 6.8668367 seconds
-> 728.1373095707955 MBytes/s
....
maxWorkers=8, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.7174496 seconds
-> 1345.0081475213544 MBytes/s
....
maxWorkers=16, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.0211296 seconds
-> 1655.0100995336313 MBytes/s
....
maxWorkers=32, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.191699 seconds
-> 1566.5637643148682 MBytes/s
....
maxWorkers=64, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.2240221 seconds
-> 1550.8578554718963 MBytes/s
....
maxWorkers=128, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.3693127 seconds
-> 1483.982178323787 MBytes/s

The value of 8x was computed by dividing the throughput value from maxWorkers=16 with maxWorkers=1 (1655 / 208 = 7.96). And this is what I was referring to.

NOTES:

  1. The above results are when net6 is targeted. I have recently re-run with net7 and the single threaded performance has improved a lot to 416 MBps. So with net7 the overall performance has improved to 1771 MBps but the relative improvement has decreased to about 4x.
  2. The pattern @"food" is a very simple regex. For more complex regex that require more CPU I would expect a greater relative performance improvement by using this approach.
dynamicbutter
  • 311
  • 1
  • 5
  • 8x improvement sounds unrealistic for a primarily I/O-bound operation on a storage device. What kind of device did you use, and got these measurements? Have you tested with a classic hard disc drive? – Theodor Zoulias Feb 23 '23 at 17:33
  • @TheodorZoulias, thanks for the comment, the regular expression searches I tested were not I/O bound and the hardware is 2019 MacBook pro. If you have any suggestions please feel free to email me or post an issue at https://github.com/imagibee/Gigantor/issues. – dynamicbutter Feb 23 '23 at 17:43
  • You mentioned in the answer that the regex is simple, so I assume that it is cheap to compute. Do you mean that it is simple to read, but hard to compute? Could you include in the answer the specific regex that you used for the performance measurements? – Theodor Zoulias Feb 23 '23 at 17:58
  • I meant simple in terms of there is not a lot of back-tracking so the CPU requirement for the search is less than a more complex regex. I would expect a bigger improvement for a more complex regex because it uses more CPU. – dynamicbutter Feb 23 '23 at 18:27
  • @TheodorZoulias I want this to be a helpful answer. Let me know if there is any questions I haven’t answered or other concerns. – dynamicbutter Feb 25 '23 at 19:33
  • I am quite puzzled how is it possible to improve 8x the performance of searching for the text `"food"` in a file stored in a SSD or hard disk drive. My assumption is that the CPU-bound work would be totally dwarfed by the I/O-bound work, which is hardly parallelizable. I can't verify your claims right now, so I'll leave the judgement to people who will actually attempt to use your library and experiment with it. – Theodor Zoulias Feb 25 '23 at 20:52
  • Thanks for your time, Gigantor is a work in progress. I can offer two reasons my search is not IO bound (as indicated by testing). 1) Uses Regex class which takes unicode strings as input so raw byte arrays are converted to unicode which takes CPU and memory. 2) It finds over 100,000 matches. Each match takes CPU and memory. I'm sure an optimized word search for "food" would be much faster than regex, but my goal is to benchmark against regex searches not more optimal word searches. Perhaps I should create a benchmark test that is more regex-y than @"food" like searching for all urls? – dynamicbutter Feb 25 '23 at 22:38
  • Searching for URLs surely sounds more relatable than searching for a specific word. – Theodor Zoulias Feb 25 '23 at 23:03