5

i get some HTML it a as ajax response, and i need to get just the body contents. So i made this regex:

/(<body>|<\/body>)/ig

works well in all browser but for some reason IE gives me an other array when i use split:

data.split(/(<body>|<\/body>)/ig)

In all normal browsers the content of the body is split(/(<body>|<\/body>)/ig)[2] but in ie its in split(/(<body>|<\/body>)/ig)[1]. (tested in IE7 & 8)

Why is this? And how could i modify it, in order to get the same array in all browsers?

edit just to clarify. I alrady have a solution as mentioned by tobyodavies. I want to understandy, why it behaves differently.

this is the HTML from the response: (the string in data)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"  xml:lang="de"  lang="de" dir="ltr">
<head>
blablabla...
</head>
<body>
<div class="iframe">
   <div id="block-menu-menu-primary-links-user" class="block-menu">
 <h3>Primary Links - User</h3>  <div class="content"><ul class="menu"><li class="leaf first"><a target="content" href="#someurl" title="">Login</a></li>
<li class="leaf last"><a target="content" href="#someurl" title="">Register</a></li>
</ul></div>
</div>
</div>
</body>
</html>

PS: i know that parsing HTML with regex is bad, but its not my code, i just need to fix it.

meo
  • 30,872
  • 17
  • 87
  • 123
  • don't use regexes to parse HTML... the
    cannot hold, it is too late! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
    – tobyodavies Apr 04 '11 at 09:13
  • i know that its bad. Its not my code. I just need to fix it but thank you :P I wonder why the result is different – meo Apr 04 '11 at 09:14
  • In your situation an XML parser will be more appropriate than a regex. – Stephan Apr 04 '11 at 09:16
  • Is it because IE is using a 0 based array and the rest 1? – BugFinder Apr 04 '11 at 09:16
  • @bugfinder: i never heard of 1 based arrays... – meo Apr 04 '11 at 09:18
  • Can you try to mimic the required behavior using string.indexOf() ? That should work the same on all browsers. – Elad Apr 04 '11 at 09:19
  • 1
    The following page lists differences in the 'split' implementation between browsers: http://blog.stevenlevithan.com/archives/cross-browser-split - not sure if any of the items listed there apply here. – Matthew Wilson Apr 04 '11 at 09:43
  • @meo, 1 based arrays almost used to be the norm, as a human if you had 5 things, you'd label them 1-5, not 0-4. If you're curious (as you seem to be) why not use your javascript to show you each element of your array for the browser, and you may see the difference. – BugFinder Apr 04 '11 at 09:43
  • @BugFinder; yeah sure but in reallife noone talks about arrays. In JS i never saw a native array beginning with 1. Thats what i wanted to say. – meo Apr 04 '11 at 09:45
  • @Matthew Wilson: Great, this is the answer i was looking for. Can you pack this in a proper Answer? – meo Apr 04 '11 at 09:46
  • @meo: added as an answer now. – Matthew Wilson Apr 04 '11 at 10:13

4 Answers4

9

The reason it behaves differently is because of the subexpression capture you have using parenthesis. Other browsers add the match inside these captures to the resulting array, IE 8 and lower do not. To get a more consistent result, you'd have to make the group non-capturing:

/(?:<body>|<\/body>)/ig

This is the reason other browsers have the content in [2] rather than [1][1] will, in theory, contain the string "<body>". The other browsers have it right on this one and Internet Explorer 9 fixed the problem by implementing the method as outlined by the ECMAScript 5th Edition specification.

There are more inconsistencies than this, though. ECMAScript 5 compliance in all browsers will resolve these differences, but you might want to take a look at Steven Levithan's blog, where he outlines the differing implementations and even provides a custom split() method as a solution to the problem.

Andy E
  • 338,112
  • 86
  • 474
  • 445
  • can you explain more because i think this why i am getting different results with `"this is me".split(/(\s)/);` vs `"this is me".split(/\s/);` – Muhammad Umer Aug 31 '13 at 19:43
  • @MuhammadUmer: yes, it is. I'm not sure how I could explain more than my answer does already without knowing which part you're unclear on, though. – Andy E Sep 01 '13 at 11:20
  • why do new browsers behave badly, include subcapture also into the result. Or is it a feature. – Muhammad Umer Sep 01 '13 at 16:29
  • @MuhammadUmer: No, as my answer states, IE is the one that got it wrong, the other browsers got it right. It was part of the specification and should have been in at least IE 6. – Andy E Sep 02 '13 at 10:54
  • i want to know why...maybe i dont know the real purpose of capture group. is it just not to have stuff to refer to later in regex. In my other tries i even saw the difference in match let alone split. Now i think the problem maybe not with split but with how i think capture group works – Muhammad Umer Sep 02 '13 at 16:01
  • simple example the result of `" ".match(/\s/);` is not equal to `" ".match(/(\s)/);` – Muhammad Umer Sep 02 '13 at 16:34
2

Have you considered just using xhr.responseXML.body.innerHTML the DOM is a lot better at parsing HTML than regexes

tobyodavies
  • 27,347
  • 5
  • 42
  • 57
  • +1 this is how i fixed it. But im still wondering, why the regex behaves differently – meo Apr 04 '11 at 09:19
  • I am guessing either the string is coming from the DOM and IE is stripping the `...`... without seeing the JS as well as the HTML I can't tell – tobyodavies Apr 04 '11 at 09:21
1

The following page lists differences in the 'split' implementation between browsers: http://blog.stevenlevithan.com/archives/cross-browser-split

Matthew Wilson
  • 3,861
  • 21
  • 14
0

You can do something like this :


var body_content;
var isIE = ( (ua.indexOf("msie") != -1) && (ua.indexOf("opera") == -1) && (ua.indexOf("webtv") == -1) );
var results = data.split(/(<body>|<\/body>)/ig);

if (isIE) {
  body_content = results[1];
} else {
  body_content = results[2];
}
Stephan
  • 41,764
  • 65
  • 238
  • 329