How to speed up this code?

Question

I got the following method which is used to read a txt file and return a dictionary. It takes ~7 mins to read a ~5MB file (67000 lines, 70 chars in each line).

public static Dictionary<string, string> FASTAFileReadIn(string file)
{
    Dictionary<string, string> seq = new Dictionary<string, string>();

    Regex re;
    Match m;
    GroupCollection group;
    string currentName = string.Empty;

    try
    {
        using (StreamReader sr = new StreamReader(file))
        {
            string line = string.Empty;
            while ((line = sr.ReadLine()) != null)
            {
                if (line.StartsWith(">"))
                {// Match Sequence
                    re = new Regex(@"^>(\S+)");
                    m = re.Match(line);
                    if (m.Success)
                    {
                        group = m.Groups;
                        if (!seq.ContainsKey(group[1].Value))
                        {
                            seq.Add(group[1].Value, string.Empty);
                            currentName = group[1].Value;
                        }
                    }
                }
                else if (Regex.Match(line.Trim(), @"\S+").Success &&
                            currentName != string.Empty)
                {
                    seq[currentName] += line.Trim();
                }
            }
        }
    }
    catch (IOException e)
    {
        Console.WriteLine("An IO exception has benn thrown!");
        Console.WriteLine(e.ToString());
    }
    finally { }

    return seq;
}

Which part of the code is most time consuming and how to speed it up?

Thanks

Related: http://stackoverflow.com/questions/3927/what-are-some-good-net-profilers — Brian Cain, Jul 24 '12 at 03:05
Don't create a new regular expression each time. Create it once, and use the `RegexOptions.Compiled` flag for some extra performance. — Ry-, Jul 24 '12 at 03:06
try to avoid regular expression they typically take more time — COLD TOLD, Jul 24 '12 at 03:07
You're using strings and string concatenation a lot; look into using `StringBuilder` - it should help speed things up considerably in your case. You could rewrite your `Dictionary` to `Dictionary` as a start. — newfurniturey, Jul 24 '12 at 03:11

score 3 · Answer 1 · answered Jul 24 '12 at 03:08

3

I hope the compiler would do this automatically, but the first thing I notice is you're re-compiling the regular expression on every matching line:

            while ((line = sr.ReadLine()) != null)
            {
                if (line.StartsWith(">"))
                {// Match Sequence
                    re = new Regex(@"^>(\S+)");

Even better if you can remove the regular expressions completely; most languages provide a split function of some sort that often smokes regular expressions...

answered Jul 24 '12 at 03:08

sarnold

102,305
22
181
238

1

Agreed, 're' should be well-defined outside the loop. – matchdav Jul 24 '12 at 03:11
1

I did statistics on this, the best way to do it is to make them static and using the `RegexOptions.Compiled`. – Yuriy Faktorovich Jul 24 '12 at 03:22

Ry- · Accepted Answer · 2012-07-24T03:47:25.337

Cache and compile regular expressions, reorder conditionals, lessen number of trimmings, and such.

public static Dictionary<string, string> FASTAFileReadIn(string file) {
    var seq = new Dictionary<string, string>();

    Regex re = new Regex(@"^>(\S+)", RegexOptions.Compiled);
    Regex nonWhitespace = new Regex(@"\S", RegexOptions.Compiled);
    Match m;
    string currentName = string.Empty;

    try {
        foreach(string line in File.ReadLines(file)) {
            if(line[0] == '>') {
                m = re.Match(line);

                if(m.Success) {
                    if(!seq.ContainsKey(m.Groups[1].Value)) {
                        seq.Add(m.Groups[1].Value, string.Empty);
                        currentName = m.Groups[1].Value;
                    }
                }
            } else if(currentName != string.Empty) {
                if(nonWhitespace.IsMatch(line)) {
                    seq[currentName] += line.Trim();
                }
            }
        }
    } catch(IOException e) {
        Console.WriteLine("An IO exception has been thrown!");
        Console.WriteLine(e.ToString());
    }

    return seq;
}

However, that's just a naïve optimization. Reading up on the FASTA format, I wrote this:

public static Dictionary<string, string> ReadFasta(string filename) {
    var result = new Dictionary<string, string>
    var current = new StringBuilder();
    string currentKey = null;

    foreach(string line in File.ReadLines(filename)) {
        if(line[0] == '>') {
            if(currentKey != null) {
                result.Add(currentKey, current.ToString());
                current.Clear();
            }

            int i = line.IndexOf(' ', 2);

            currentKey = i > -1 ? line.Substring(1, i - 1) : line.Substring(1);
        } else if(currentKey != null) {
            current.Append(line.TrimEnd());
        }
    }

    if(currentKey != null)
        result.Add(currentKey, current.ToString());

    return result;
}

Tell me if it works; it should be much faster.

Does `string line in File.ReadAllLines()` build an entire (array? list?) from the file in one go, or does it build each `line` on-demand? — sarnold, Jul 24 '12 at 03:16
@sarnold: Sorry, you're right. I meant `ReadLines()`, which creates an `IEnumerable`. (Although if the file is only 5MB, it may be beneficial to read the whole thing to begin with...) — Ry-, Jul 24 '12 at 03:18
Yeah, five megs, it probably doesn't matter much one way or the other. But I've seen some _huge_ FASTA files before.. — sarnold, Jul 24 '12 at 03:19
What happens if currentKey (identifier of the sequence) is followed by a tab instead of a space? — Cristian Ciupitu, Jul 24 '12 at 03:39
@CristianCiupitu: It's treated as part of the identifier. From what I can glean from the Wikipedia article, a space and not a tab should always be used. Otherwise... `IndexOfAny` :) — Ry-, Jul 24 '12 at 03:41
Another tiny nitpick: `line.IndexOf(' ')` -> `line.IndexOf(' ', 1)` because you've already examined the first character. — Cristian Ciupitu, Jul 24 '12 at 03:43
@CristianCiupitu: Was just adding that :) Actually, it'll be `2` because it shouldn't be blank. I think not, at least... why is there no actual spec I can see somewhere? — Ry-, Jul 24 '12 at 03:48

score 1 · Answer 3 · edited Jul 24 '12 at 03:39

1

You can improve the reading speed substantially by using a BufferedStream:

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    // Use the StreamReader
}

The Regex recompile @sarnold mentioned is probably your largest performance killer, though, if your processing time is ~5 minutes.

edited Jul 24 '12 at 03:39

Ry-

218,210
55
464
476

answered Jul 24 '12 at 03:10

Eric J.

147,927
63
340
553

Hah, when I saw your answer, my first thought was, "hey, I'll bet that's where 90% of the slowdown is coming from". – sarnold Jul 24 '12 at 03:15

score 1 · Answer 4 · answered Jul 24 '12 at 03:32

Here's how I would write it. Without more information (i.e how long the average dictionary entry is) I can't optimize the StingBuilder capacity. You could also follow Eric J.'s advice and add a BufferedStream. Ideally, you'd do away with Regular Expressions entirely if you want to crank out the performance, but they are a lot easier to write and manage, so I understand why you'd want to use them.

public static Dictionary<string, StringBuilder> FASTAFileReadIn(string file)
{
    var seq = new Dictionary<string, StringBuilder>();
    var regName = new Regex("^>(\\S+)", RegexOptions.Compiled);
    var regAppend = new Regex("\\S+", RegexOptions.Compiled);

    Match tempMatch = null;
    string currentName = string.Empty;
    try
    {
        using (StreamReader sReader = new StreamReader(file))
        {
            string line = string.Empty;
            while ((line = sReader.ReadLine()) != null)
            {
                if ((tempMatch = regName.Match(line)).Success)
                {
                    if (!seq.ContainsKey(tempMatch.Groups[1].Value))
                    {
                        currentName = tempMatch.Groups[1].Value;
                        seq.Add(currentName, new StringBuilder());
                    }
                }
                else if ((tempMatch = regAppend.Match(line)).Success && currentName != string.Empty)
                {
                    seq[currentName].Append(tempMatch.Value);
                }
            }
        }
    }
    catch (IOException e)
    {
        Console.WriteLine("An IO exception has been thrown!");
        Console.WriteLine(e.ToString());
    }

    return seq;
}

As you can see, I've slightly changed your dictionary to use the optimized StringBuilder class for appending values. I've also pre-compiled the regular expressions once and once only to ensure that you aren't redundantly recompiling the same regular expression over and over again. I've also extracted your "append" case to compile into a Regular Expression as well.

Let me know if this helps you out performance-wise.

How to speed up this code?

4 Answers4