0

I've been noticing that the following segment of code does not scale well for large files (I think that appending to the paneContent string is slow):

            string paneContent = String.Empty;
            bool lineFound = false;
            foreach (string line in File.ReadAllLines(path))
            {
                if (line.Contains(tag))
                {
                    lineFound = !lineFound;
                }
                else
                {
                    if (lineFound)
                    {
                        paneContent += line;
                    }
                }
            }
            using (TextReader reader = new StringReader(paneContent))
            {
                data = (PaneData)(serializer.Deserialize(reader));
            }

What's the best way to speed this all up? I have a file that looks like this (so I wanna get all the content in between the two different tags and then deserialize all that content):

A line with some tag 
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with content I want to get into a single stream or string
A line with some tag

Note: These tags are not XML tags.

Alexandru
  • 12,264
  • 17
  • 113
  • 208
  • You could start by determining which part of you code takes a long time to run for bigger files. If it's the ReadAllLines function, you could just open the file in read mode, read each line until it matches the first row, get all the rows you want and close the file when you get to the last row. That way you would not read all the file if it's not required. – Dany Gauthier Jun 04 '13 at 20:46
  • How would you use a regex here? – Alexandru Jun 04 '13 at 20:47
  • @Alexandru This is how I would use a regex: http://stackoverflow.com/questions/6560672/java-regex-to-extract-text-between-tags – Josh C. Jun 04 '13 at 20:52
  • Regex would be good for a string but all of this data is in a file. – Alexandru Jun 04 '13 at 20:54
  • If it is not too much to load the file in memory, I would load it and run the regex. http://stackoverflow.com/questions/4195540/c-read-data-from-txt-file – Josh C. Jun 04 '13 at 20:56
  • If it is too expensive to load the whole file, but you know the tags will never be broken across lines, you should use the string builder. I would still use a regex to trigger the begin and end of the string building. – Josh C. Jun 04 '13 at 20:58
  • Also, this post answers using a regex against a large file: http://stackoverflow.com/questions/9546273/python-3-searching-a-large-text-file-with-regex – Josh C. Jun 04 '13 at 20:59

3 Answers3

3

You could use a StringBuilder as opposed to a string, that is what the StringBuilder is for. Some example code is below:

var paneContent = new StringBuilder();
bool lineFound = false;
foreach (string line in File.ReadLines(path))
{
    if (line.Contains(tag))
    {
        lineFound = !lineFound;
    }
    else
    {
        if (lineFound)
        {
            paneContent.Append(line);
        }
    }
}
using (TextReader reader = new StringReader(paneContent.ToString()))
{
    data = (PaneData)(serializer.Deserialize(reader));
}

As mentioned in this answer, a StringBuilder is preferred to a string when you are concatenating in a loop, which is the case here.

Community
  • 1
  • 1
JMK
  • 27,273
  • 52
  • 163
  • 280
3

Here is an example of how to use groups with regexes and retrieve their contents afterwards.

What you want is a regex that will match your tags, label this as a group then retrieve the data of the group as in the example

Community
  • 1
  • 1
Eric
  • 19,525
  • 19
  • 84
  • 147
1

Use a StringBuilder to build your data string (paneContent). It's much faster because concatenating strings results in new memory allocations. StringBuilder pre-allocates memory (if you expect large data strings, you can customize the initial allocation).

It's a good idea to read your input file line-by-line so you can avoid loading the whole file into memory if you expect files with many lines of text.

xxbbcc
  • 16,930
  • 5
  • 50
  • 83