What is the best way to search for strings in multiple files?
Currently I am doing a foreach loop through each file but have noticed it takes up to 4-5min to go through all 4000+ files
Is there some sort of parallel way to do this?
What is the best way to search for strings in multiple files?
Currently I am doing a foreach loop through each file but have noticed it takes up to 4-5min to go through all 4000+ files
Is there some sort of parallel way to do this?
The best way to do this is the Producer Consumer model. What you do with this is you have one thread read from the hard drive and load the data in to a queue, then you have a indeterminate number of other threads process the data.
So say your old code was this
foreach(var file in Directory.GetFiles(someSearch)
{
string textToRead = File.ReadAllText(file);
ProcessText(textToRead)
}
The new code would be
var collection = new BlockingCollection<string>(); //You may want to set a max size so you don't use up all your memory
Task producer = Task.Run(() =>
{
foreach(var file in Directory.GetFiles(someSearch)
{
collection.Add(File.ReadAllText(file))
}
collection.CompleteAdding();
});
Parallel.ForEach(collection.GetConsumingEnumerable(), ProcessText); //Make sure any actions ProcessText does (like incrementing any variables in the class) is done in a thread safe manner.
What this does is it lets one thread read from the hard drive and not fight any other threads for I/O, but it lets multiple threads process the data that was read in all at the same time.
If you are doing this search regurlarly, consider indexing your files using some search engine, like Solr. After files are indexed, search would take milliseconds.
You can also embedd search engine in your app, for example, using Lucene library.
The chances are that most of the time is spent waiting for the files to be read from the disk. In that situation, multithreading isn't going to help you a huge deal - rather than having one thread waiting for disk IO, you now have several threads waiting for disk IO.
The operation for this is largely going to be I/O bound, so parallel processing won't really give you any added performance. You can try indexing the files using a 3rd-party search library, but that's really all you can do as far as software goes. Splitting the files across multiple drives and using a different thread for each drive could help speed things up, if that's an option.