4000 files that I want to do a string search on

Question

What is the best way to search for strings in multiple files?

Currently I am doing a foreach loop through each file but have noticed it takes up to 4-5min to go through all 4000+ files

Is there some sort of parallel way to do this?

1) buy a big SSD, 2) buy a big RAID array (and only then use threads) — H H, Jun 21 '13 at 19:29
reading sequentially from one physical device is best thing you can do in case of performance. if you open 10 files at once, and try to scan them, you will suffer penalty from excessive seeking each time io operation is performed on the hard drive. ALSO, did you make measurement and found out WHERE is the biggest wait? — Daniel Mošmondor, Jun 21 '13 at 19:37
If you know anything specific about the files... like the text you are searching for is withing the first 100 bytes of the file, you could save yourself some time by only opening those bytes. Faster search/faster to close the file. — Gray, Jun 21 '13 at 19:37

Scott Chamberlain · Answer 1 · 2013-06-21T20:11:44.337

4

The best way to do this is the Producer Consumer model. What you do with this is you have one thread read from the hard drive and load the data in to a queue, then you have a indeterminate number of other threads process the data.

So say your old code was this

foreach(var file in Directory.GetFiles(someSearch)
{
     string textToRead = File.ReadAllText(file);
     ProcessText(textToRead)
}

The new code would be

var collection = new BlockingCollection<string>(); //You may want to set a max size so you don't use up all your memory

Task producer = Task.Run(() =>
{
    foreach(var file in Directory.GetFiles(someSearch)
    {
         collection.Add(File.ReadAllText(file))
    }
    collection.CompleteAdding();
});
Parallel.ForEach(collection.GetConsumingEnumerable(), ProcessText); //Make sure any actions ProcessText does (like incrementing any variables in the class) is done in a thread safe manner.

What this does is it lets one thread read from the hard drive and not fight any other threads for I/O, but it lets multiple threads process the data that was read in all at the same time.

edited Jun 21 '13 at 20:11

answered Jun 21 '13 at 19:36

Scott Chamberlain

124,994
33
282
431

I haven't used BlockingCollection before, but if I had known what it was for some of my earlier projects I might have used that. – Katana314 Jun 21 '13 at 20:05
Yea, it's really nice, it handles all of the thread safety stuff for you, using `GetConsumingEnumerable()` makes it super simple to work with. One thing to be aware of, by default BlockingCollection uses a [ConcurrentQueue](http://msdn.microsoft.com/en-us/library/dd267265.aspx) for it's internal storage, if you don't care about the order things come out of the collection you can use a [different constructor](http://msdn.microsoft.com/en-us/library/dd287133.aspx) and pass in a [ConcurrentBag](http://msdn.microsoft.com/en-us/library/dd381779.aspx) and have less resource contention. – Scott Chamberlain Jun 21 '13 at 20:07
Perfect! So my next issue is, I have a class that I am updating along each method but am not sure how to pass the class onto my "ProcessText" method. How would I handle that? Thanks btw, this is the first time I've heard of BlockingCOllection as well :) – nGX Jun 21 '13 at 20:56
You can just use a anonymous delegate `Parallel.ForEach(collection.GetConsumingEnumerable(), (text) => ProcessText(text,someOtherVariable));` – Scott Chamberlain Jun 21 '13 at 21:04
I have noticed that this process takes up an enormous amount of memory, I'm looking at around 600mb for some reason this is taking a significant time as well, more than the original method. Any advice? – nGX Jun 21 '13 at 21:26
pass a max size in to the blocking collection constructor: `var collection = new BlockingCollection(5);` that would let a max of 5 files to be held in memory at once. Also you may want to limit the number of threads Parallel.ForEach starts up, I ran in to that problem [myself](http://stackoverflow.com/questions/6977218/parallel-foreach-can-cause-a-out-of-memory-exception-if-working-with-a-enumera)! – Scott Chamberlain Jun 21 '13 at 22:29

score 3 · Answer 2 · answered Jun 21 '13 at 19:31

3

If you are doing this search regurlarly, consider indexing your files using some search engine, like Solr. After files are indexed, search would take milliseconds.

You can also embedd search engine in your app, for example, using Lucene library.

answered Jun 21 '13 at 19:31

alex

12,464
3
46
67

score 0 · Answer 3 · answered Jun 21 '13 at 19:31

0

The chances are that most of the time is spent waiting for the files to be read from the disk. In that situation, multithreading isn't going to help you a huge deal - rather than having one thread waiting for disk IO, you now have several threads waiting for disk IO.

answered Jun 21 '13 at 19:31

Chris

4,661
1
23
25

Not necessarily, if you cache the file in to RAM and the files are several megs in size the act of the search can be significantly slower that of reading it in. – Scott Chamberlain Jun 21 '13 at 19:33

score 0 · Answer 4 · answered Jun 21 '13 at 19:34

0

The operation for this is largely going to be I/O bound, so parallel processing won't really give you any added performance. You can try indexing the files using a 3rd-party search library, but that's really all you can do as far as software goes. Splitting the files across multiple drives and using a different thread for each drive could help speed things up, if that's an option.

answered Jun 21 '13 at 19:34

Nathan

2,093
3
20
33

For several megabyte files the search can be significantly slower than loading the file. – Scott Chamberlain Jun 21 '13 at 19:37
That seems unlikely given typical memory bandwidth vs. I/O speed. Most of the time waiting is probably spent waiting for the disk's seek head to reposition, given that several thousand files are probably not in contiguous space. Just my thoughts. – Nathan Jun 21 '13 at 19:43

4000 files that I want to do a string search on

4 Answers4

Linked