5

I'm writing a program to help me search for a keyword inside thousands of files. Each of these files has unnecessary lines that i need to ignore because they mess with the results. Luckily they're all located after a specific line inside those files.
What i've already got is a search, without ignoring the lines after that specific line, returning an Enumerable of the file names containing the keyword.

var searchResults = files.Where(file => File.ReadLines(file.FullName)
                                            .Any(line => line.Contains(keyWord)))
                                            .Select(file => file.FullName);

Is there a simple and fast way to implement this functionality? It doesn't necessarily have to be in Linq as i'm not even sure if this would be possible.

Edit:
An example to make it clearer. This is how the text files are structured:
xxx
xxx
string
yyy
yyy

I want to search the xxx lines until either the keyword is found or the string and then skip to the next file. The yyy lines i want to ignore in my search.

drouning
  • 385
  • 4
  • 16
  • My main Problem is that i don't know how to ignore the lines after the "string". Searching the "yyy" lines gives too many false positives in the results. – drouning Jul 30 '15 at 09:31
  • Have you looked at this question? http://stackoverflow.com/questions/31717324/searching-in-text-files-until-specific-string – Enigmativity Jul 30 '15 at 09:58

5 Answers5

4

Try this:

var searchResults = files.Where(file => File.ReadLines(file.FullName)
                                            .TakeWhile(line => line != "STOP")
                                            .Any(line => line.Contains(keyWord)))
                                            .Select(file => file.FullName);
Ghasan غسان
  • 5,577
  • 4
  • 33
  • 44
1

You can process files in parallel, just add AsParallel() after "files". This should improve files processing speed. ReadLines does not read the whole file before search, so it should work as you expect.

EDIT: sorry misread your question first time and haven't noticed stop word. Given that I think it would be more easy to avoid LINQ:

        IEnumerable<FileInfo> parallelFiles = files.AsParallel();
        var result = new ConcurrentBag<string>();
        foreach (var file in parallelFiles)
        {
            foreach (string line in File.ReadLines(file.FullName))
            {
                if (line.Contains(keyWord))
                {
                    result.Add(file.FullName);
                    break;
                }
                else if (line.Contains(stopWord))
                {
                    break;
                }
            }
        }
sarh
  • 6,371
  • 4
  • 25
  • 29
1

It's only a minor modification: ignore the lines that don't contain the search string and only read the first occurrence:

var searchResults = files.Where(file => File.ReadLines(file.FullName)
                                            .TakeWhile(line => != myString)
                                            .Any(line => line.IndexOf(keyWord) > -1)
                               )
                         .Select(file => file.FullName);
Gert Arnold
  • 105,341
  • 31
  • 202
  • 291
  • That does help, but what about the cases where the keyword is only present in the "yyy" lines? That would still result in a few false positives. – drouning Jul 30 '15 at 09:43
  • Not completely, I think your code assumes that "myString" is always there in the lines i want to ignore, but that's not the case. "myString" is a single line in the files that serves as a delimiter. I'd like to ignore every line after "myString" even if it contains the keyword. – drouning Jul 30 '15 at 09:58
  • OK, so `myString` is the complete line after which you want to stop searching and `keyWord` is the word you're looking for. – Gert Arnold Jul 30 '15 at 10:22
  • I now see that Ghasan has essentially the same solution. Give it to him if it suits you. – Gert Arnold Jul 30 '15 at 10:37
0

if you want to remove a specific string from a pretty big string i prefer you look at the link below

Fastest way to remove chars from string

Edit: As per your new content in question

According to me my way is a little primitive but kind of effective

string FileString = "Your String to search from";
int LastIndexToRead = FileString .IndexOf("Your Specific String");
string NewStr = FileString .Substring(0, LastIndexToRead);

If your file is way bigger then i suggest you to break the string into multiple pieces for better performance.


Hope it helps

Community
  • 1
  • 1
Developer Nation
  • 374
  • 3
  • 4
  • 20
0

You might be able to do something with the Enumerable<string> which ReadLines returns.

If the lines you can ignore in each file are after a specific line number you may be able to cut these from the Enumerable (you may need to ToList() or whatever first).

If the placement of the section to ignore is dynamic, then presumably you can identify it from a header string or similar?

If so, your best bet will likely be to:

  • Open File
    • Read line by line (manually)
      • Look "Skip from here" string
        • Skip the rest of this file
      • Look for string matching search keyword.
        • Record file as matching
xan
  • 7,440
  • 8
  • 43
  • 65
  • The Placement is dynamic but it's always the same string so it can easily be identified. What you wrote is exactly what i want to do, but is that possible in linq? – drouning Jul 30 '15 at 09:37