What about this JS RegEx makes it fail in IE7 and IE8 but not IE9?

Question

I thought the community helped me nail this problem w/a case insensitive RegExp, but I got it wrong. What about the following RegEx fails in IE7 and IE8?

var reggy = /(\s*?)<span\b(?:.*?)(?:class=(?:'|"|.*?\s)?foobar(?:\s|\3))(?:.*?)(?:\/)?>(.+?)<\/span>(\s*?)/ig;

jsFiddle here. Only in IE7 and IE8 does it give a "did not match" result.

What is it you're trying to do? Perhaps a regular expression isn't the best solution to this. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — tvanfosson, Nov 09 '11 at 19:49
this looks like a ridiculous regular expression, there is no point in over complicating everything, you should just do this proceduraly. It also looks like you are trying to use regex to identify html, which is _wrong_ http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — GAgnew, Nov 09 '11 at 19:55
I've seen that post. I'm not trying to parse an HTML document, I'm trying to pattern match a single HTML node. Do you think I should be using an HTML parser? — buley, Nov 09 '11 at 19:55
Changing from `(?:class=(?:'|"|.*?\s)?foobar` to `(class=('|"|.*?\s)?foobar` is doing the trick. Still have no clue why. — buley, Nov 09 '11 at 20:20

score 2 · Accepted Answer · answered Nov 09 '11 at 23:46

There are several problems with that regex, the worst of them being that you seem to be mixing up capturing and non-capturing groups. As Mike Samuel hinted, the third capturing group is the (\s*?) at the very end (which, like the one at the beginning, served no useful purpose). Try this regex:

/<span\b[^>]*\bclass=\s*(['"]?)forbes_entity\1[^>]*>[\s\S]*?<\/span>/ig

Here there's only the one capturing group; it captures a single-quote, a double-quote, or nothing. After the class name, the \1 matches the same thing again. (I changed the class name to match the sample text in your fiddle.)

It turned out I didn't need any other groups, but if I had needed them I would have used non-capturing groups ( (?:...) ) to make it easier to keep track of the capturing-group numbers. I also used [\s\S] instead of . to match the span's contents, in case it contains any newlines.

Thanks for the advice. I also tried some variations w/fewer capturing groups, and one interesting thing I noticed was that it seemed there was a limit on the number of capturing groups. Having fewer is likely merited on its own but if it's the case that there's some limit than this is especially true. — buley, Nov 10 '11 at 01:36

score 1 · Answer 2 · answered Nov 09 '11 at 21:57

1

\3 looks suspicious since it can never match anything but the empty string since the third capturing group follows it. Could IE be treating the \3 before the third capturing group as an octal escape, i.e. as equivalent to \u0003?

In older versions of IE, \s had a non-standard meaning -- it did not match \u00A0 for example.

answered Nov 09 '11 at 21:57

Mike Samuel

118,113
30
216
245

Or maybe the older IE's are treating it as an error because it's a forward reference. I think the ECMAScript standard says that it should simply succeed without consuming any characters because the group it references has not yet participated in the match. Maybe IE wasn't following that rule before. – Alan Moore Nov 09 '11 at 23:28
1

@AlanMoore, I thought the spec says the initial value of a group is blank and they are reset every time a containing repetition is entered, but I guess that arrives at the same conclusion. – Mike Samuel Nov 10 '11 at 00:26

What about this JS RegEx makes it fail in IE7 and IE8 but not IE9?

2 Answers2