How to search for a string in a large text file, using buffers and multithreading

Question

I want to find a string in a very large text file (containing tens of gigabytes of text). I need to use buffers and multithreading, and this is what I thought to do:

Create a loop reading chunk of the text using buffer at each iteration.
Split the chunk into the given number of threads.
Each thread will search the string in the part of the text it received.
If the string was found print its location, else read another chunk of the text file.

This is what I tried to do:

string[] lines = System.IO.File.ReadAllLines(@textfile);
int counter = lines.Length / nThreads;
int start = 0;
int end = counter;
(int,int)[] results = new (int, int)[nThreads];

results.ToList().ForEach(i => Console.WriteLine(i.ToString()));
for (int i = 0; i < nThreads; i++)
{
    Thread thread1 = new Thread(() => { results[i] = ThreadSearch
        .SubarraySearch(StringToSearch, Delta, lines, start, end); });
    // ThreadSearch - function that search the string in the array
    thread1.Start();
    thread1.Join();
}
// At the end I will go through the results array see if any
// of the threads found something

Now I have two problems at implementing this algorithm:

I don't now how to run all the threads at the same time and stop when result is found.
I only know how to read line by line using buffer and not a chunk from the text.

Constraints and invariants of the input data:

I don't know the exact size of the text file, only that it's too large to read at once, but I do know that the maximum buffer size of thread is 10k.
The size of string I'm searching is unknown - probably a word from the text file, it will contain only letters and digits, no spaces or new lines.

I'm new at C#, so any help will be great.

What if the searched text happens to be split in half at the boundary between two consecutive buffers? — Theodor Zoulias, Apr 17 '21 at 14:24
the instruction was to use a buffer to read the file (because its too large to be read at once) and use several threads in perellal to find the string — Dana, Apr 17 '21 at 14:28
Is there any guarantee about the size or the content of the searched string? For example is it possible for the searched string to contain the new-line (CR) or line-feed (LF) characters? Or, another example, is it possible to be 10 billion chars long? — Theodor Zoulias, Apr 17 '21 at 14:35
@Dana You want to add the new details in the post, not in comments, new readers won't necessarily search in comments. — Soleil, Apr 17 '21 at 14:36
nothing was said about the string size , in the example given it was one word from the text file so im asumming its not 10 GB long — Dana, Apr 17 '21 at 14:40
Ok, my 10-billion-chars-long string example was unrealistic anyway, since the [maximum length of a string](https://stackoverflow.com/questions/140468/what-is-the-maximum-possible-length-of-a-net-string) is much shorter than that. But in order to provide a valid answer we must know what the constraints are. For example a solution that uses the [`File.ReadLines`](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readlines) method will fail to detect a string that contains the CR or LF characters. — Theodor Zoulias, Apr 17 '21 at 14:49
Welcome to Stack Overflow. This is not a free homework service. I suggest you start by writing out the requirements of the exercise in [pseudo-code](https://en.wikipedia.org/wiki/Pseudocode). Then translate however many of the pieces as you are able into C#. Then make an effort to translate the remaining pieces into C#. Then ask for help on the specific pieces you get stuck on. — Nicholas Hunter, Apr 17 '21 at 14:55
@TheodorZoulias ok i checked and it will not contain CR or LF or spaces -it contains only letters & digits — Dana, Apr 17 '21 at 14:57
@NicholasHunter hey this is not homework , im trying to study c# by myself using challenges i found on the internet but still struggling with the syntax , i provided my algorithm idea and still trying to implent it , i was hoping to get some help here — Dana, Apr 17 '21 at 15:02
@ThomasWeller I ask the author. Don't see 146GB in original message. — i486, Apr 17 '21 at 15:10
@Dana - Since you're reading a file from disk it is very unlikely that multithreading this is going to be faster than using a single thread. File IO is extremely slow so your threads would be idle for a large amount of time. That said you still need to use buffers, but it's a waste of time to use more than one thread. — Enigmativity, Apr 19 '21 at 02:35
@Enigmativity searching for a string inside another string is generally fast, but it can get quite slow in some special cases with repetitive characters/patterns in the two strings. For example this expression: `new String('a', 1_000_000).Contains(new String('a', 10_000) + 'x');` takes around 3 seconds in my PC (before returning `false`). — Theodor Zoulias, Apr 19 '21 at 11:16
@TheodorZoulias - That's a bit dumb for the framework to take that long to find a longer string inside a shorter one. Nonetheless, there are exceedingly fast string searches available. It's a solved problem. — Enigmativity, Apr 19 '21 at 12:42
@Enigmativity no, it's the other way around. The 10,001-length (shorter) string is not found inside the 1,000,000-length (longer) string. The reason for the bad performance is that in order to check the 990,000 possible positions that the short string can be inside the long string, you must do 10,001 forward comparisons for each position, which all fail on the 10,001st comparison. That's a similar kind of problem with the one faced by some [backtracking](http://www.regular-expressions.info/catastrophic.html) regular expressions, that need practically infinite time to compute a result. — Theodor Zoulias, Apr 19 '21 at 12:57
@TheodorZoulias - Oh, wow. I misread the `10_000`. Sorry, it was a bit dumb for me! LOL — Enigmativity, Apr 19 '21 at 21:07

score 2 · Answer 1 · answered Apr 17 '21 at 14:59

First you want to benchmark the file reading and the buffer processing in order to know if you're IO bound, memory bound or CPU bound. If you're IO bound, the multithread approach is useless. You can use the c# stopwatch, you can use Intel's Advisor and Vtune profilers (it works on any assembly), you can use the c# profilers (alt+F2 in visual studio).
When you split your file in n memory chunks / buffers, you take also the risk to split the keyword too, therefore you want to take that aspect into account in the multithread approach, and into the engineering + maintenance cost.
You want to open several filestreams and each stream to be associated to one thread.

In c# you want to use the more modern than threads approach which is tasks; when you found the keyword fully, you need to have a thread safe update (use lock); a semi-pseudocode:

private object _mtx = new();
private List<long> _foundPositions = new();

public void StartMultireader()
{
    int ncores = 16; // add manually/from config file/from CPU detection
    for (int i = 0; i < ncores; i++)
        Task.Run(() => FindKeyword(i));
}

private void FindKeyword(int fileChunk)
{
    OpenFileStream(fileChunk);
    SearchAndUpdateResult();
}

private void SearchAndUpdateResult()
{
    long position = 0;
    //search;
    {
        //if found
        {
            lock (_mtx)
            {
                _foundPositions.Add(position);
            }
        }
    }
}

private void OpenFileStream(int fileChunk)
{
}

Theodor Zoulias · Answer 2 · 2023-02-17T20:37:14.983

Here is a PLINQ approach. By using the PLINQ library you can avoid explicit thread management. The library manages the threads for you, using ThreadPool threads. The AsParallel operator signifies that the subsequent operators will be executed in parallel.

string filePath = @"C:\YourFile.txt";
string searchedText = "TheTextToFind";

bool fileContainsText = File
    .ReadLines(filePath)
    .AsParallel()
    .WithDegreeOfParallelism(Environment.ProcessorCount)
    .Any(line => line.Contains(searchedText, StringComparison.Ordinal));

This solution uses the File.ReadLines method, so it works under the assumption that the searchedText does not contain CR or LF characters. Otherwise the string will not be detected.

The above solution is unlikely to be faster than a normal LINQ query, because of the synchronization overhead associated with passing each individual line to a different thread. A more sophisticated approach would be to process the lines in chunks of, say, 1000 lines per chunk:

bool fileContainsText = Partitioner
    .Create(File.ReadLines(filePath).Chunk(1000),
        EnumerablePartitionerOptions.NoBuffering)
    .AsParallel()
    .WithDegreeOfParallelism(Environment.ProcessorCount)
    .Any(array =>
    {
        return Array.FindIndex(array, line =>
            line.Contains(searchedText, StringComparison.Ordinal)) >= 0;
    });

The Chunk LINQ operator was introduced in .NET 6. For older .NET platforms see the 1st revision of this answer.

The EnumerablePartitionerOptions.NoBuffering configuration optimizes the memory usage, by instructing the PLINQ to avoid buffering the elements of the input enumerable. This is important when working with batches, because each batch can be quite large memory-wise, and reclaiming this memory ASAP is more important than minimizing the synchronization overhead (this is the intention of buffering the input of the query, PLINQ's default behavior, known as chunk partitioning).

How to search for a string in a large text file, using buffers and multithreading

2 Answers2

Linked