2

I am currently developing an application that reads a text file of about 50000 lines. For each line, I need to check if it contains a specific String.

At the moment, I use the conventional System.IO.StreamReader to read my file line by line.

The problem is that the size of the text file changes each time. I made several test performance and I noticed that when the file size increase, the more time it will take to read a line.

For example :

Reading a txt file that contains 5000 lines : 0:40
Reading a txt file that contains 10000 lines : 2:54

It take 4 times longer to read a file 2 times larger. I can't imagine how much time it will takes to read a 100000 lines file.

Here's my code :

using (StreamReader streamReader = new StreamReader(this.MyPath))
{
     while (streamReader.Peek() > 0)
     {
          string line = streamReader.ReadLine();

          if (line.Contains(Resources.Constants.SpecificString)
          {
               // Do some action with the string.
          }
     }
}

Is there a way to avoid the situation: bigger File = more time to read a single line?

J. Steen
  • 15,470
  • 15
  • 56
  • 63
MHeads
  • 267
  • 1
  • 7
  • 18
  • `string.Contains` is not very efficient for this type of search. – leppie Apr 19 '13 at 13:34
  • Are you searching a few files, multiple times? Or Multiple files 1 time? – Dave Bish Apr 19 '13 at 13:35
  • Using regular expression on the text file is much faster and much more efficient. – Izzy Apr 19 '13 at 13:35
  • A compiled RegEx should perform much better. – alexn Apr 19 '13 at 13:36
  • I am looking through one file only. – MHeads Apr 19 '13 at 13:36
  • 1
    I doubt that a RegEx is faster that a simple call to Contains. – Dirk Apr 19 '13 at 13:37
  • You've never used it. – Izzy Apr 19 '13 at 13:39
  • did you try with "string[] lines = File.ReadAllLines("file.txt"); " and then loop through array... didn't try, but maybe is faster. – Davor Mlinaric Apr 19 '13 at 13:40
  • Yeah I tried and it take the same time. – MHeads Apr 19 '13 at 13:42
  • 5
    @DavorMlinaric `File.ReadAllLines` will read the whole thing into memory - not a desirable thing, if you wish to process things line-by-line. – Sergey Kalinichenko Apr 19 '13 at 13:43
  • 1
    The "Do some action" portion may be the culprit here. Did you try replacing it with a simple counter, and then comparing the timing of 5000 vs. 10000 lines? The time should grow linearly, unless you've got some disproportionally long strings in the second file. – Sergey Kalinichenko Apr 19 '13 at 13:45
  • @dasblinkenlight I use the same file for the timing tests. I only copied all of the 5000 rows twice. the time increases exponentially and this is my problem – MHeads Apr 19 '13 at 13:47
  • 3
    I doubt if you keep doubling the size of the file you'll see a an exponential growth in the time taken. It *should* be approximately O(N), since only linear operations are being done. I've no idea why the big slowdown when doubling the size of the file - does it contain similar data to the first file? What is the "DoSomeAction()"? I agree with dasblinkenlight - it might be the culprit. – Matthew Watson Apr 19 '13 at 13:48
  • Have you looked at [this](http://stackoverflow.com/questions/2161895/reading-large-text-files-with-streams-in-c-sharp)? He is reading much larger files then you. He is suggesting a `BufferedStream` to improve the read speed – Arion Apr 19 '13 at 13:50
  • it is what I thought too. I find it strange that the reading time is exponential. The do some action is only a piece of code that allows me to search through another .csv file in order to find the corresponding line. If the line is found, I added it into a datatable. – MHeads Apr 19 '13 at 13:53
  • @Sebastien The time does not look exponential, it looks quadratic. Try 15000 lines, and see if it's nine times the 5000-line's time to prove this suggestion. The next thing to try would be shortening the line for which you are searching (i.e. the `SpecificString`) to a single character, and trying this experiment again. The timing of `Contains` is `O(m*n)`, where `m` is the length of the line of text, and `n` is the length of the string being searched. – Sergey Kalinichenko Apr 19 '13 at 13:59

2 Answers2

7

Try this:

var toSearch = Resources.Constants.SpecificString;
foreach (var str in File.ReadLines(MyPath).Where(s => s.Contains(toSearch))) {
    // Do some action with the string
}

This avoids accessing the resources on each iteration by caching value before the loop. If this does not help, try writing your own Contains based on an advanced string searching algorithm, such as the KMP.


Note: be sure to use File.ReadLines which reads lines lazily (unlike similarly looking File.ReadAllLines that reads all lines at once).

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • Using a variable to store the search time will improve performance, but using a LINQ query instead of a simple if statement will decrease it because lambda functions are quite expensive. – Dirk Apr 19 '13 at 13:48
  • 2
    @Dirk Microsoft's team optimized the heck out of lambdas, so they are not expensive at all. – Sergey Kalinichenko Apr 19 '13 at 13:56
0

Use RegEx.IsMatch and you should see some performance improvements.

using (StreamReader streamReader = new StreamReader(this.MyPath))
{
 var regEx = new Regex(MyPattern, RegexOptions.Compiled);

 while (streamReader.Peek() > 0)
 {
      string line = streamReader.ReadLine();

      if (regEx.IsMatch(line))
      {
           // Do some action with the string.
      }
 }
}

Please remember to use a compiled RegEx, however. Here's a pretty good article with some benchmarks you can look at.

Happy coding!

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
elucid8
  • 1,412
  • 4
  • 19
  • 40