I want to find a string in a very large text file (containing tens of gigabytes of text). I need to use buffers and multithreading, and this is what I thought to do:
- Create a loop reading chunk of the text using buffer at each iteration.
- Split the chunk into the given number of threads.
- Each thread will search the string in the part of the text it received.
- If the string was found print its location, else read another chunk of the text file.
This is what I tried to do:
string[] lines = System.IO.File.ReadAllLines(@textfile);
int counter = lines.Length / nThreads;
int start = 0;
int end = counter;
(int,int)[] results = new (int, int)[nThreads];
results.ToList().ForEach(i => Console.WriteLine(i.ToString()));
for (int i = 0; i < nThreads; i++)
{
Thread thread1 = new Thread(() => { results[i] = ThreadSearch
.SubarraySearch(StringToSearch, Delta, lines, start, end); });
// ThreadSearch - function that search the string in the array
thread1.Start();
thread1.Join();
}
// At the end I will go through the results array see if any
// of the threads found something
Now I have two problems at implementing this algorithm:
- I don't now how to run all the threads at the same time and stop when result is found.
- I only know how to read line by line using buffer and not a chunk from the text.
Constraints and invariants of the input data:
- I don't know the exact size of the text file, only that it's too large to read at once, but I do know that the maximum buffer size of thread is 10k.
- The size of string I'm searching is unknown - probably a word from the text file, it will contain only letters and digits, no spaces or new lines.
I'm new at C#, so any help will be great.