5

I've been investigating this issue that only seems to get worse the more I dig deeper.

I started innocently enough trying to use this expression to split a string on HTML 'br' tags:

T = captions.innerHTML.split(/<br.*?>/g);

This works in every browser (FF, Safari, Chrome), except IE7 and IE8 with example input text like this:

is invariably subjective. <br /> 
The less frequently used warnings (Probably/Possibly) <br /> 

Please note that the example text contains a space before the '/', and precedes a new line.

Both of the following will match all HTML tags in every browser:

T = captions.innerHTML.split(/<.*?>/g);
T = captions.innerHTML.split(/<.+?>/g);

However, surprisingly (to me at least), this does not work in FF and Chrome:

T = captions.innerHTML.split(/<br.+?>/g);

Edit:

This (suggested several times in the responses below,) does not work on IE 7 or 8:

T = captions.innerHTML.split(/<br[^>]*>/g);

(It did work on Chrome and FF.)

My question is: does anyone know an expression that works in all current browsers to match the 'br' tags above (but not other HTML tags). And can anyone confirm that the last example above should be a valid match since two characters are present in the example text before the '>'.

PS - my doctype is HTML transitional.

Edit:

I think I have evidence this is specific to the string.split() behavior on IE, and not regex in general. You have to use split() to see this issue. I have also found a test matrix that shows a failure rate of about 30% for split() test cases when I ran it on IE. The same tests passed 100% on FF and Chrome:

http://stevenlevithan.com/demo/split.cfm

So far, I have still not found a solution for IE, and the library provided by the author of that test matrix did not fix this case.

Walt Jones
  • 1,293
  • 3
  • 14
  • 17

7 Answers7

15

The reason your code is not working is because IE parses the HTML and makes the tags uppercase when you read it through innerHTML. For example, if you have HTML like this:

<div id='box'>
Hello<br>
World
</div>

And then you use this Javascript (in IE):

alert(document.getElementById('box').innerHTML);

You will get an alert box with this:

Hello<BR>World

Notice the <BR> is now uppercase. To fix this, just add the i flag in addition to the g flag to make the regex be case-insensitive and it will work as you expect.

Paolo Bergantino
  • 480,997
  • 81
  • 517
  • 436
6

Try this one:

/<br[^>]*>/gi
Chad Birch
  • 73,098
  • 23
  • 151
  • 149
  • 1
    I'd advise /gi since you never know how someone will case their tags – Yevgeny Simkin May 04 '09 at 22:58
  • This works in Chrome and FF, and fails in IE. I'm giving +1 because it *should* work. – Walt Jones May 04 '09 at 23:00
  • Btw, as I now realize it does NOT fail when used exactly as you provided here. I omitted the 'i' flag because I was working with a known lower-case source. lesson learned: IE up-cases tags in innerHTML. – Walt Jones May 05 '09 at 02:00
1

Instead of

/<br.*?>/

you could try

/<br[^>]*>/

i.e. matching "<br", followed by any characters other than '>', followed by '>'.

hlovdal
  • 26,565
  • 10
  • 94
  • 165
0

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you may be interested in the JavaScript+DOM answer.

Community
  • 1
  • 1
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
  • 2
    Yep, I'm not intending to do a full HTML parser, and this is not a jQuery environment. Please note, there is not a problem with regex handling this, but a browser compat issue in IE 7 and 8. (Although the example that failed in FF also puzzles me.) – Walt Jones May 04 '09 at 23:03
  • 1
    "Regexes are fundamentally bad at parsing HTML" -- not if you know what the input is going to look like. – nickf May 05 '09 at 00:02
  • @Walt Gordon Jones It isn't a matter of what you intend to do or not, regexes can't handle HTML, it isn't what they are good at, at least take a look at doing it with a parser, you can always use the DOM. – Chas. Owens May 05 '09 at 00:51
  • @nickf And is the input going to stay the same? Using a parser saves you time in the long run as regexes are extremely fragile when parsing HTML (if they even work in the first place). – Chas. Owens May 05 '09 at 00:52
  • Guys, I totally agree. You are making an excellent point, but in this specific case, I just needed to create an array using 'br' tags as delimiters. I don't think there's a DOM method for that, is there? – Walt Jones May 06 '09 at 00:07
0

Well, unfortunately I don't have a wide variety of browsers at work (just IE - sigh) but right off the bat I can see a way to optimize your regex:

T = captions.innerHTML.split(/<br[^>]*?>/g);

The inline character class definition [^>] instructs the expression to match any character EXCEPT the greater-than sign. You may also want to make it case insensitive (pass gi at the end not just g).

Goyuix
  • 23,614
  • 14
  • 84
  • 128
  • In some regular expression engines, the *? operator indicates non-greedy matching, where /.*?>/ will match any character up to the *first* point where the following text matches. Without the ?, /.*>/ matches up to the *last* point where the following text matches. – Greg Hewgill May 04 '09 at 23:15
  • Yes, want the first match (obviously), but the [^>] looks like a clever way to force first match since that's only way to satisfy the condition. Regardless, even the variations that should be greedy do not match at all under IE. – Walt Jones May 04 '09 at 23:19
0

Tested in Firefox 3 & IE7:

/<br.*?>/gi

Try it yourself here: http://jsbin.com/ofoke

var input = "one <br/>\n" 
          + "two <br />\n" 
          + "three <br>\n" 
; 

alert(input.replace(/<br.*?>/gi, ''));
nickf
  • 537,072
  • 198
  • 649
  • 721
  • I believe I have determined the issue is specifically with String.split on IE. (Your example uses String replace.) Look at this test case matrix for split(): http://stevenlevithan.com/demo/split.cfm IE fails about 30% of the cases. FF and Chrome pass this matrix 100%. – Walt Jones May 05 '09 at 00:16
  • could you then try doing something like a replace using a regex, to replace
    tags with "||BR||" and then use a normal non-regex to split it? input.replace(//gi, '||BR||').split("||BR||"); Does that work?
    – nickf May 05 '09 at 07:26
0

<\sbr\s/?\s*>

matches

<br>, <br />, < br >,<br / >

I tested here in IE.6. If march is Ok, the js could certainly split it according to the regexp.

unigogo
  • 537
  • 4
  • 9