0

I want to search a string (in every 5 to 10 mints) in text files or folder which might be of 500MB space. I want to know which technique would be fruitful to implement this searching. Please let me know if anything was unclear. I am using csharp.net.

Thanx & Regards

Marxist Ali
  • 65
  • 2
  • 11
  • 2
    The text file is 1Gb and you want to search it every 5 or 10 mins? How much of the file? I'm thinking there might be a better way to do what you're doing... – Ste May 03 '12 at 10:05
  • Boyer Moore should give you the best results if the string is 'fixed' (iow not a regex pattern). – leppie May 03 '12 at 10:07
  • What are the other characteristics of the file ? Is it sorted ? I guess not since you want to search it, but is it partially sorted ? Does the string that you are searching for remain constant, or do you search for a different string every 5/10 minutes. Tell us everything. – High Performance Mark May 03 '12 at 10:09
  • File is not sorted. And yes string will be fixed for every 5 to 10 mints. – Marxist Ali May 03 '12 at 10:11
  • @Ste: Searching a 10GB file should take no more than a few seconds if the content is cached/in-memory/on ssd. – leppie May 03 '12 at 10:20
  • @leppie Possibly but I'd still like to know the context so we can be sure it's the best way. :) – Ste May 03 '12 at 10:22

2 Answers2

2

The first thing to do is to write something that achieves your desired result.

Then use a profiler to determine what is taking the longest time!

Once you've found the bit that takes the longest time, see if there's any way to improve that.

Now, from your question, I can probably determine that the bit that's going to take the longest will be the transfer of the data from the hard drive to RAM. If the file contains different data each time you search it then this will define the upper limit of how fast you can do the search. If the file does not change then there's a few possibilities to improve the upper limit.

But first, find out what's taking the time.

Skizz
  • 69,698
  • 10
  • 71
  • 108
  • 4
    Wow, this is really doing things the hard way! Time would be much better spent reading up on string searching algorithms. – leppie May 03 '12 at 10:23
  • 1
    @leppie: My point is that the difference between brute force and smart searching will be dwarfed by the time taken to load the data off the disk, and profiling would highlight this. Profiling is essential, without you're just stumbling around in the dark. There's no point spending time researching string searching algorithms if all your run-time is spent just loading the data. – Skizz May 03 '12 at 10:45
  • "write something that achieves your desired result" - isn't the OP asking just that? – thewpfguy May 03 '12 at 11:15
  • @thewpfguy: well, the OP is asking "which technique would be fruitful". This insinuates that the OP is aware of many techniques, and wishes to choose between them before writing any code. But Skizz is right, just writing something (anything) that works is not a major task, and it wouldn't be a bad idea for the OP to approach the choice of technique from a starting point of having one technique that is implemented and working. If he writes something very simple, and it turns out that it only takes a few seconds to scan the data, he could save hours of time choosing between KMP or Boyer-Moore. – Steve Jessop May 03 '12 at 11:36
  • ... time which can only possibly improve the runtime of the program by a few seconds (the time spent scanning the data), when a profiler might have told him that the slow part of the program is the disk I/O, and that he should have spent time reading about that instead of reading about string searching. – Steve Jessop May 03 '12 at 11:39
0

You can use the Windows Desktop Search API (see here) to perform your search.

Community
  • 1
  • 1
Vincent Vancalbergh
  • 3,267
  • 2
  • 22
  • 25