8

I have a code example where a MatchCollection seems to hang the program when trying to use it with foreach.

I am parsing css using a class CSSParser:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using Helpers.Extensions;

namespace Helpers.Utils
{
    public class CSSParser
    {
        private readonly Dictionary<string, Dictionary<string, string>>
            _dict = new Dictionary<string, Dictionary<string, string>>();

        private const string SelectorKey = "selector";
        private const string NameKey = "name";
        private const string ValueKey = "value";

        private const string GroupsPattern
            = @"(?<selector>(?:(?:[^,{]+)\s*,?\s*)+)\{(?:(?<name>[^}:]+)\s*:\s*(?<value>[^};]+);?\s*)*\}";

        private const string CommentsPattern
            = @"(?<!"")\/\*.+?\*\/(?!"")";

        private readonly Regex _pattern
            = new Regex(GroupsPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);

        public CSSParser(string cssString)
        {
            var noCommentsString = Regex.Replace(cssString, CommentsPattern, "");
            var matches = _pattern.Matches(noCommentsString);

            foreach (Match item in matches)
            {
                var selector = item.Groups[SelectorKey].Captures[0].Value.Trim();

                var selectorParts = selector.Split(',').Select(s=>s.Trim());
                foreach(var part in selectorParts)
                {
                    if (!_dict.ContainsKey(part))
                      _dict[part] = new Dictionary<string, string>();
                }

                var classNameCaptures = item.Groups[NameKey].Captures;
                var valueCaptures = item.Groups[ValueKey].Captures;

                var count = item.Groups[NameKey].Captures.Count;

                for (var i = 0; i < count; i++)
                {
                    var className = classNameCaptures[i].Value.TrimIfNotNull();
                    var value = valueCaptures[i].Value.TrimIfNotNull();

                    foreach(var part in selectorParts)
                    {
                        _dict[part][className] = value;
                    }
                }
            }
        }

        public IEnumerable<KeyValuePair<string,string>> LookupValues(string selector)
        {
            IEnumerable<KeyValuePair<string,string>> result
                = new KeyValuePair<string,string>[]{};
            if (_dict.ContainsKey(selector))
            {
                var subdict = _dict[selector];

                result = subdict.ToList();
            }

            return result;
        }

        public string LookupValue(string selector, string style)
        {
            string result = null;
            if (_dict.ContainsKey(selector))
            {
                var subdict = _dict[selector];

                if (subdict.ContainsKey(style))
                    result = subdict[style];
            }

            return result;
        }
    }
}

and it works fine with input like this:

        [TestMethod]
        public void TestParseMultipleElementNames()
        {
          const string css = @"h1, h2, h3, h4, h5, h6
{
    font-family: Georgia, 'Times New Roman', serif;
    color: #006633;
    line-height: 1.2em;
    font-weight: normal;
}
";

          var parser = new CSSParser(css);

          Assert.AreEqual("normal", parser.LookupValue("h4", "font-weight"));
        }

but when I run it with a css string containing no attributes:

        [TestMethod]
        public void TestParseNoAttributesStyle()
        {
          const string css = @"
#submenu-container
{
}
";

          var parser = new CSSParser(css);

          Assert.IsFalse(parser.LookupValues("#submenu-container").Any());
        }

the program is hanging on this line in CSSParser:

foreach (Match item in matches)

The debugger stops marking the currently executing line, the loop block itself is never reached.

Why is the MatchCollection hanging my program?

For completeness:

namespace Helpers.Extensions
{
  public static class StringExtension
  {
    public static string TrimIfNotNull(this string input)
    {
      return input != null ? input.Trim() : null;
    }
  }
}
Anders Lindén
  • 6,839
  • 11
  • 56
  • 109
  • 2
    I'm no expert with Regex but the Regex engine can get stuck or take a long time if it has to do constant lookahead and lookbehind operations. Designing your Regex to minimise these would help. Perhaps a Regex expert can help you with the specifics. +1 =] – Sean Airey Aug 02 '13 at 12:13
  • 1
    Pause the debugger and look at the stack to determine what is going on. Check "Show external code" to see everything. – usr Aug 02 '13 at 12:16
  • 1
    a) languages like this shouldn't *really* be parsed with Regex. b) why not [use something off-the-shelf](http://stackoverflow.com/q/512720/50776)? It will probably be much faster and more robust and get you ahead in your project more than doing it by hand will. – casperOne Aug 02 '13 at 12:16
  • @usr I am debugging line by line, but the debugger will not let me step one line further after the foreach line. Instead it hangs. – Anders Lindén Aug 02 '13 at 12:17
  • 3
    Now hit pause. It will show you where exactly it hangs in the .NET code. – usr Aug 02 '13 at 12:18
  • Also, is the CPU pegged on one core? In that case the regex is just inefficient. – usr Aug 02 '13 at 12:21
  • When I had hit break, the "in" word became highlighted. When continuing, the test actually became successful, the test time was 57 seconds. When I tried to run the test again without trying to interrupt it, it succeeded after 22 seconds. Meaning that all I need to do is to add efficiency to the regular expression! – Anders Lindén Aug 02 '13 at 12:22
  • Ok you found the problem. But next time also look at the stack. – usr Aug 02 '13 at 12:23
  • You may be able to make due with an inefficient regex in some cases. If you can confirm that the only time the Regex hangs it is going to anyway fail, you can add a timeout to your regex. – JoelFan May 13 '20 at 07:04

3 Answers3

1

Your Regex is just inefficient and burning CPU. You can confirm this by a) looking at the CPU time used and b) pausing the debugger repeatedly and looking at the stack (will be in the bowels of the Regex engine).

usr
  • 168,620
  • 35
  • 240
  • 369
  • Then why are only empty css blocks hard to parse? – Anders Lindén Aug 02 '13 at 12:24
  • 1
    I don't know. Fixing the regex is not part of the question. The question is "can a MatchCollection hang". Answer: no, just slow. – usr Aug 02 '13 at 14:00
  • My best guess is that an empty css block results in many failed match attempts, many backtracking. A filled css block might match quickly. – usr Aug 02 '13 at 15:19
1

As far as I can tell .net goes into an eternal loop because it tries different approaches with the regex you've got (the GroupsPattern one) - I believe it makes a mistake somewhere. I've had a look at this regex and as far as I can tell you can easily remove two of the \s*, namely those that come before the negation groups respectively [^,{]+ and [^}:]+ since they already capture spaces.

So that is, instead of:

private const string GroupsPattern = @"(?<selector>(?:(?:[^,{]+)\s*,?\s*)+)\{(?:(?<name>[^}:]+)\s*:\s*(?<value>[^};]+);?\s*)*\}";

I'd have:

private const string GroupsPattern = @"(?<selector>(?:(?:[^,{]+),?\s*)+)\{(?:(?<name>[^}:]+):\s*(?<value>[^};]+);?\s*)*\}";

Now this is regex expressions, so the chances of me having overlooked something are rather large. Also I believe this also results in some of the named capture groups to perhaps have extra spaces in them (but it seems you trim them anyway).

Hope it is usable. Although it still takes quite some time it works with the example you gave.

Ykok
  • 1,313
  • 13
  • 15
  • The regexp matcher were only slow, no eternal loops. When I had done some modifications, the cssparser could parse a 10k css file in an eye blink which is sufficient for my needs. I will consider using your regexp. – Anders Lindén Aug 02 '13 at 15:13
0

I changed the regexp from:

private const string GroupsPattern
    = @"(?<selector>(?:(?:[^,{]+)\s*,?\s*)+)\{(?:(?<name>[^}:]+)\s*:\s*(?<value>[^};]+);?\s*)*\}";

to:

private const string GroupsPattern
    = @"(?<selector>(?:(?:[^,{]+)\s*,?\s*)+)\{\s*(?:(?<name>[^}:\s]+)\s*:\s*(?<value>[^};]+);?\s*)*\}";

and the execution time went down from 22 seconds to 1 ms.

Anders Lindén
  • 6,839
  • 11
  • 56
  • 109
  • 1
    The first regexp is not the same as in the original post - there is a `\s` extra in the name group. The new regex will also yield different results than the original one. For the fun of it you can also try it with the following text (an extra space have been inserted) `h1, h2, h3, h4, h5, h6 { font-fa mily : Georgia, 'Times New Roman', serif ; color: #006633; line-height: 1.2em; font-weight : normal; }` – Ykok Aug 02 '13 at 13:10
  • Thanks, I edited the post so it shows the correct information! I will never have spaces in attribute names. – Anders Lindén Aug 02 '13 at 13:33