4

I'm writing an application that is intended to process thousands of articles/entries with large number of spintax in the following format:

{Hello|Hi} {World|There!}, how are you?

However when I run the application with profiler I noticed the part where Regex is handled is hogging a lot of resource and my application eventually crash due to out of memory issue. Can anyone suggest a way to improve my code or a better way to parse the spintax?

public static String Spin(String text)
{
        Regex reg = new Regex(@"\{[^\{\}]*\}");
        Random rand = new Random((int)DateTime.Now.Ticks);
        while (true)
        {
            Match m = reg.Match(text);
            if (!m.Success) break;
            String[] parts = m.Value.TrimStart('{').TrimEnd('}').Split('|');
            int i = rand.Next(parts.Length);
            text = text.Substring(0, m.Index) + parts[i] + text.Substring(m.Index + m.Length);
        }
        return text;
  }
John Saunders
  • 160,644
  • 26
  • 247
  • 397
user1928346
  • 513
  • 6
  • 21
  • Is it the actual regex you're using that's slow or is it the rest of the code? You could use possessive quantifiers to speed up the process in order to avoid backtracking if that's causing you trouble. However, I don't think it will change much. – Firas Dib Dec 26 '12 at 13:53
  • I think specifically it's the regex.match part that is hogging the most resource. The rest of the code is fine. – user1928346 Dec 26 '12 at 13:58
  • I see your code does not support nested spintaxes. Is it intended? – SergeyS Dec 26 '12 at 14:06
  • No it is not intended. It is supposed to support nested spintaxes however I have no clue how to achieve that. – user1928346 Dec 26 '12 at 14:07
  • What fraction of whole method execution time (in %) regex.match takes? – SergeyS Dec 26 '12 at 14:17
  • 1
    It's unlikely that anything in your `Spin` method is causing a memory leak. The memory problem is somewhere else in your code. Also, if this is going to be called thousands of times, you should make the regex static (defined outside the method), and use the `RegexOptions.Compiled` option when creating it. – Jim Mischel Dec 26 '12 at 14:23
  • @JimMischel - Initially I agreed with you regarding `static` and `RegexOptions.Compiled`, but hm, I've implemented both and they don't seem to make much of a difference (http://ideone.com/FHYC4E). I tried using a much longer string in this version, too. Perhaps the compiler already optimizes for these factors, somehow. – Andrew Cheong Dec 26 '12 at 16:10
  • @acheong87: Define the regex outside the method: `static Regex reg = new Regex(@"\{[^\{\}]*\}", RegexOptions.Compiled);`. Don't create a new one inside the method. The idea is that you define it once and use it many times. – Jim Mischel Dec 26 '12 at 16:48
  • @JimMischel - Hm, I create it only once, in the `main` method. It's reused by through the iterations, I think. – Andrew Cheong Dec 26 '12 at 16:57
  • @acheong87: My apologies. I misread the code. Try it with 1,000,000 iterations rather than 1,000. See http://stackoverflow.com/a/6005226/56778 for a little more context. – Jim Mischel Dec 26 '12 at 19:36
  • @JimMischel - Hm. With Ideone's 15-second execution limit I could not go past 2,000. But I plotted some benchmarks for 100, 200, 500, 1000, and 2000 runs on a logarithm scale, and it's pretty much linear, _i.e._ I don't see it curving in; in fact it's curving out. I have a feeling that compiled regexes are only useful for more complicated regexes: that is, maybe a simple alternation (`|`) may already be optimized even without explicitly having the regex compiled. One day I'll do some testing at home and report back. – Andrew Cheong Dec 26 '12 at 22:15

3 Answers3

7

I have implemented my fast version (no Regex, no Split, no Substring, no Replace and other string manipulation methods). To copy string I'm using String.CopyTo which copies symbols to a plain char array.

This code fully supports nested Spintaxes (potentially unlimited depth). One restriction is maximum number of options per Spintax, currently it is 100, but can be changed to 1000 or more... Another restriction is maximum length of input string, it is 100000 now, but can be increased too.

Concerning performance - my tests showed that this code is >15 times faster than any optimised Regex solution (including Jim Mischel's one) and ~5 times faster than versions which are using Substring and other String manipulation methods. I have tested this in Release mode with Optimize Code setting in VS 2012.

    static int[] partIndices = new int[100];
    static int[] depth = new int[100];
    static char[] symbolsOfTextProcessed = new char[100000];

    public static String SpinEvenMoreFaster(String text)
    {
        int cur = SpinEvenMoreFasterInner(text, 0, text.Length, 0);
        return new String(symbolsOfTextProcessed, 0, cur);
    }

    public static int SpinEvenMoreFasterInner(String text, int start, int end, int symbolIndex)
    {
        int last = start;
        for (int i = start; i < end; i++)
        {
            if (text[i] == '{')
            {
                int k = 1;
                int j = i + 1;
                int index = 0;
                partIndices[0] = i;
                depth[0] = 1;
                for (; j < end && k > 0; j++)
                {
                    if (text[j] == '{')
                        k++;
                    else if (text[j] == '}')
                        k--;
                    else if (text[j] == '|')
                    {
                        if (k == 1)
                        {
                            partIndices[++index] = j;
                            depth[index] = 1;
                        }
                        else
                            depth[index] = k;
                    }
                }
                if (k == 0)
                {
                    partIndices[++index] = j - 1;
                    int part = rand.Next(index);
                    text.CopyTo(last, symbolsOfTextProcessed, symbolIndex, i - last);
                    symbolIndex += i - last;
                    if (depth[part] == 1)
                    {
                        text.CopyTo(partIndices[part] + 1, symbolsOfTextProcessed, symbolIndex, partIndices[part + 1] - partIndices[part] - 1);
                        symbolIndex += partIndices[part + 1] - partIndices[part] - 1;
                    }
                    else
                    {
                        symbolIndex = SpinEvenMoreFasterInner(text, partIndices[part] + 1, partIndices[part + 1], symbolIndex);
                    }
                    i = j - 1;
                    last = j;
                }
            }
        }
        text.CopyTo(last, symbolsOfTextProcessed, symbolIndex, end - last);
        return symbolIndex + end - last;
    }
SergeyS
  • 3,515
  • 18
  • 27
  • Holy, !@#$! You're right. This is way faster. [Ideone evidence here.](http://ideone.com/ItZ4y2) SpinRE: 03.760s. SpinNoRE: 00.795s. SpinSergey: 00.148s. +1. – Andrew Cheong Dec 27 '12 at 18:07
  • Thanks:) I'm curious if OP is still monitoring this page:) But problem itself was interesting! – SergeyS Dec 27 '12 at 20:52
  • I uploaded a screenshot http://screencast.com/t/0Go7HHxTi where you can see a three different spinner codes and SergeyS is the best spinner code so far. I hope your code @SergeyS, can improve more because I wrote a code that generates 1 to N level and all the spinner code barely can't go to 20th level – jaysonragasa Oct 14 '14 at 04:43
4

Here's a non-Regex alternative.

Update 2012-12-27 (see new ideone demo)

  1. Optimized OP's code to use static members instead of declaring variables within the loop, use RegexOptions.Compiled, and use Substring instead of TrimLeft and TrimRight. These optimizations cut down execution time of OP's code by nearly 33%.
  2. Updated SpinNoRE to handle arbitrarily nested spintaxes, optimized code, and added comments.
  3. Renamed Spin and SpinFaster to SpinRE and SpinNoRE respectively, for clarity.
  4. Updated test case with nested example. OP's code was much, much slower to handle nested spintaxes (understandably, since every level of nesting forces an additional regex match).

New ideone demo available; code below (comments available in demo; see link):

public static String SpinNoRE(String text)
{
    int i, j, e = -1;
    char[] curls = new char[] {'{', '}'};
    text += '~';

    do
    {
        i =  e;
        e = -1;
        while ((i = text.IndexOf('{', i+1)) != -1)
        {
            j = i;
            while ((j = text.IndexOfAny(curls, j+1)) != -1 && text[j] != '}')
            {
                if (e == -1) e = i;
                i = j;
            }
            if (j != -1)
            {
                parts = text.Substring(i+1, (j-1)-(i+1-1)).Split('|');
                text = text.Remove(i, j-(i-1)).Insert(i, parts[rand.Next(parts.Length)]);
            }
        }
    }
    while (e-- != -1);

    return text.Remove(text.Length-1);
}

Result:

Input Text:       Oh! {{I'm|You're} here!|How are you{ doing{|, {buddy|pal|guy}}|}?}
Testing SpinRE:   Oh! You're here!
Testing SpinRE:   Oh! How are you doing?
Testing SpinRE:   Oh! How are you?
Testing SpinRE:   Oh! How are you doing, buddy?
Testing SpinRE:   Oh! I'm here!
Testing SpinRE:   Oh! How are you doing, guy?
Testing SpinRE:   Oh! How are you doing?
Testing SpinRE:   Oh! I'm here!
Testing SpinRE:   Oh! I'm here!
Testing SpinRE:   Oh! How are you doing?
Testing SpinNoRE: Oh! How are you doing, buddy?
Testing SpinNoRE: Oh! You're here!
Testing SpinNoRE: Oh! How are you?
Testing SpinNoRE: Oh! How are you?
Testing SpinNoRE: Oh! You're here!
Testing SpinNoRE: Oh! I'm here!
Testing SpinNoRE: Oh! How are you doing?
Testing SpinNoRE: Oh! How are you?
Testing SpinNoRE: Oh! How are you doing, buddy?
Testing SpinNoRE: Oh! I'm here!

Time elapsed over 100,000 runs of each in alternation:

SpinRE:           03.686s
SpinNoRE:         00.921s

(It's been more than 6 years since I touched C#. Please forgive and point out any mistakes.)

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • Doe not work for nested Spintaxes. Try this for example: "{{I'm|you're} here!|how are you?}" – SergeyS Dec 26 '12 at 23:24
  • @SergeyS - I was only trying to emulate OP's regex solution, which didn't support nested Spintaxes either. I was more trying to demonstrate how much faster this could be without regular expressions, not provide a full Spintax solution. – Andrew Cheong Dec 26 '12 at 23:49
  • @SergeyS - My apologies. I initially read your comment on OP's question about his code not supporting nested spintaxes, and I assumed that to be the case. I'm updating my answer. – Andrew Cheong Dec 27 '12 at 11:56
  • well done! However I have found one little bug in your solution. Try this string to repro it: "Curly {}{brackets|braces} here!". If interested - take a look at my code, I did not use Split and Substring and it is ~5 times faster than yours. – SergeyS Dec 27 '12 at 17:27
  • @SergeyS - Thanks! I won't bother to fix the bug, since it seems your solution is superior anyway. I'm primarily a C/C++ developer myself, so I was looking up equivalent functions for `strpos`, `strtok`, etc. on MSDN, haha. Unfortunately I couldn't figure out how to get that "low" in level. – Andrew Cheong Dec 27 '12 at 18:57
  • I uploaded a screenshot http://screencast.com/t/0Go7HHxTi where you can see a three different spinner codes and SergeyS is the best spinner code so far. I hope your code @SergeyS, can improve more because I wrote a code that generates 1 to N level and all the spinner code barely can't go to 20th level – jaysonragasa Oct 14 '14 at 04:51
1

I would recommend a few changes to your code. First, move the regular expression definition out of the method and use the RegexOptions.Compiled option to reduce the setup time per call. Also, move creation of the random number generator out of the heavily used method.

Also, you can eliminate much unnecessary string searching by telling the regular expression where to start matching. This is important if you end up doing many iterations of the loop. The idea is that if you've already done the replacements up to position M in the string, there's no reason to check those for matches because there won't be any.

You can eliminate the calls to TrimStart and TrimEnd by replacing the expression with:

String[] parts = m.Value.Substring(1, m.Value.Length-2).Split('|');

You already know that the string starts with { and ends with }, and doesn't have either of those two characters anywhere in the middle, so all you have to do is chop off the first and last characters. There's no reason to incur the cost of the temporary strings created by TrimStart and TrimEnd.

Another possibility would be to add a capture group to the regular expression (placing parentheses around the part you want to capture), and operating on the captured text rather than the entire matched expression.

Putting all those suggestiong together leads to:

static Regex reg = new Regex(@"\{([^\{\}]*)\}", RegexOptions.Compiled);
static Random rand = new Random();
public static String Spin(String text)
{
    int matchPos = 0;
    while (true)
    {
        Match m = reg.Match(text, matchPos);
        if (!m.Success) break;
        String[] parts = m.Groups[1].Value.Split('|');
        int i = rand.Next(parts.Length);
        text = text.Substring(0, m.Index) + parts[i] + text.Substring(m.Index + m.Length);
        matchPos = m.Index;
    }
    return text;
}

That said, this won't support nesting, and making a regex solution that does support nesting is likely to be somewhat difficult. It's also less than optimum in terms of speed because it spends a lot of time building and rebuilding the text string. With a little thought, you can optimize it a bit more, but it will never be as fast as an optimized custom parser solution like SergyS supplied.

If speed is paramount, then you'll want a custom parser. The regex version won't be as fast, but if it's fast enough it has the benefit of being smaller, and easier to understand and modify than a custom parser.

Community
  • 1
  • 1
Jim Mischel
  • 131,090
  • 20
  • 188
  • 351