read a text file and search for string in memory efficient way (and abort when found)

Question

I'm searching for a string in a text file (also includes XML). This is what I thought first:

using (StreamReader sr = File.OpenText(fileName))
{
    string s = String.Empty;
    while ((s = sr.ReadLine()) != null)
    {
        if (s.Contains("mySpecialString"))
            return true;
    }
}

return false;

I want to read line by line to minimize the amount of RAM used. When the string has been found it should abort the operation. The reason why I don't process it as XML is because it has to be parsed and would also consume more memory as necessary.

Another easy implementation would be

bool found = File.ReadAllText(path).Contains("mySpecialString") ? true : false;

but that would read the complete file into memory, which isn't what I want. On the other side it could have a performance increase.

Another one would be this

foreach (string line in File.ReadLines(path))
{
    if (line.Contains("mySpecialString"))
    {
        return true;
    }
}
return false;

But which one of them (or another one from you?) is more memory efficient?

score 3 · Accepted Answer · answered May 06 '15 at 13:38

3

You can use a query with File.ReadLines, so it only reads as many lines as it needs to, in order to satisfy your query. The Any() method will stop when it hits a line containing your string.

return File.ReadLines(fileName).Any(line => line.Contains("mySpecialString"));

answered May 06 '15 at 13:38

Grant Winney

65,241
13
115
165

Wow. That solution looks nice and simple! – testing May 06 '15 at 13:45

score 2 · Answer 2 · edited May 23 '17 at 11:43

I also prefer the accepted answer. Maybe i'm micro opimizing things here but you have asked for a memory efficient approach. Also consider that the text you are searching could also contain new-line characters like '\r', '\n' or "\r\n" and a large file could theoretically contain a single line which negates the benefit of ReadLines.

So you could use this method:

public static bool FileContainsString(string path, string str, bool caseSensitive = true)
{
     if(String.IsNullOrEmpty(str))
        return false;

    using (var stream = new StreamReader(path))
    while (!stream.EndOfStream)
    {
        bool stringFound = true;
        for (int i = 0; i < str.Length; i++)
        {
            char strChar = caseSensitive ? str[i] : Char.ToUpperInvariant(str[i]);
            char fileChar = caseSensitive ? (char)stream.Read() : Char.ToUpperInvariant((char)stream.Read());
            if (strChar != fileChar)
            {
                stringFound = false;
                break; // break for-loop, start again with first character at next position
            }
        }
        if (stringFound) 
            return true;
    }
    return false;
}

bool containsString = FileContainsString(path, "mySpecialString", false); // ignore case if desired

Note that this might be the most efficient approach and hidden in a method also readable. But it has one drawback, it's not feasible to implement a culture-sensitive comparison because it looks at single characters and not at substrings.

So you have to keep some edge cases in mind where you can run into issues, like the famous turkish i example or surrogate pairs.

Thanks for your addition. Can you elaborate the last point a little bit more? For which cases does it work and for which not? — testing, May 07 '15 at 06:21
@testing: for example the famous [turkish `i` example](https://msdn.microsoft.com/en-us/library/ms994325.aspx#cltsafcode_topic4). Or [surrogate pairs](https://msdn.microsoft.com/en-us/library/vstudio/8k5611at%28v=vs.100%29.aspx) which consist of two characters. — Tim Schmelter, May 07 '15 at 07:41
But that are edge cases, in mosts cases it's not an issue. You just have to keep it in mind. — Tim Schmelter, May 07 '15 at 07:55
there is an unobvious flaw here! a little difficult to explain but if it matches only the first few chars and breaks out, but a match is right after it gets skipped over. make any sense? — colin lamarre, Jan 11 '20 at 03:43

score 1 · Answer 3 · answered May 06 '15 at 13:40

1

I think both of your solutions are the same. Read at the MSDN: https://msdn.microsoft.com/en-us/library/dd383503%28v=vs.110%29.aspx

There it says: "The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned"

The same article also suggests that ReadLines should be used in conjunction with LINQ to Objects.

answered May 06 '15 at 13:40

Zoran Horvat

10,924
3
31
43

Which *both* solution do you mean? I also read about `ReadAllLines`, but I didn't used it in my examples. So `ReadLines` seems the way to go. Thanks. – testing May 06 '15 at 13:44
1

Ah yes, I jumped over it. ReadLines should be the same as the first solution with ReadLine. ReadAllLines is less efficient because it reads the complete file. – Zoran Horvat May 06 '15 at 13:51

read a text file and search for string in memory efficient way (and abort when found)

3 Answers3

Linked