4

I'm trying to think of the most efficient way to search a directory full of text files (possibly 2000 files around 150 lines each) for a keyword. If I was just searching for one keyword then performance wouldn't be so much of an issue, but in my application I want to be able to search for a different keyword at a later point, possibly multiple times. So iterating over the entire file collection each time seems time consuming. And storing everything in memory seems quite memory expensive too.

What would be the best way to do this? I don't have access to an SQL database or anything like that, so I can't temporarily dump the contents into a database and search that periodically; it's just going to be a regular Windows application.

The most primitive approach I can think of is to dump all of the files into one huge XML file and search that - rather than iterating through all of the files in the directory each time a keyword search happens. But even this seems like it could be quite time intensive?

I will know the directory name in advance, so I can pre-process the contents - if this could possibly help in-so-far as optimisation.

Any suggestions are welcome, thanks.

theqs1000
  • 357
  • 3
  • 7
  • 18
  • Since you don't want to use a database, how about building your own inverted index? I guess you want to some kind of full text search? – DerApe Nov 05 '12 at 13:57
  • "And storing everything in memory seems quite memory expensive too." - How about doing a run and counting how much data we're actually talking about? – NPSF3000 Nov 05 '12 at 13:57
  • Is it possible to make changes in the folder when you start the program, or is it that once you started your program it can create a index? – Frederiek Nov 05 '12 at 13:57

2 Answers2

3

Why not use a cmd utility that you call from C#?

The findstr utility in DOS can do what you need and it is efficient: http://technet.microsoft.com/en-us/library/bb490907.aspx

How to call it from C#: How To: Execute command line in C#, get STD OUT results

Good luck!

Community
  • 1
  • 1
Roy Dictus
  • 32,551
  • 8
  • 60
  • 76
  • Thanks for your responses. Yes, I would require a full text search. It is possible that changes could be made in the directory once the program is started - although I suppose it wouldn't be out of the question to periodically rebuild any index. – theqs1000 Nov 05 '12 at 14:05
  • 1
    Thanks - between findstr, which it appears I can run over a whole directory, and Lucene which was linked to above I think I've definitely been pushed in the right direction. Thanks again for your responses. – theqs1000 Nov 05 '12 at 14:12
0

As "L.B" stated, you can use Lucene.net for creating an inverted index. It is a .Net implmentation from a java library. Lucene on apache.org

This is a small example how to do it.

DerApe
  • 3,097
  • 2
  • 35
  • 55