4

I know that a parser would best be suited for this situation but in my current situation it has to be just straight javascript.

I have a regex to find the closing body tag of an html doc.

var closing_body_tag = /(<\/body>)/i;

However, this fails when source has more than 1 body tag set. So I was thinking about going with something like this..

var last_closing_body_tag = /(<\/body>)$/gmi;

This works for the case when multiple tags are found, but for some reason it is failing on cases with just 1 set of tags.

Am I making a mistake that would cause mixed results for single tag cases?

Yes, I understand more than one body tag is incorrect, however, we have to handle all bad source.

Adam
  • 3,615
  • 6
  • 32
  • 51
  • 7
    And why would you have more than one body tag ? – adeneo Apr 24 '15 at 15:08
  • 1
    Just curious. Why do you need to find the closing body tag? What are you going to do with that? – hindmost Apr 24 '15 at 15:09
  • 3
    You don't need jQuery for parsing HTML. – Ram Apr 24 '15 at 15:09
  • @adeneo The internet is a mysterious place full of people with bad decisions. It is just a use case we have to handle, correct or incorrect. – Adam Apr 24 '15 at 15:09
  • @hindmost We insert a tag right before the last closing body tag. – Adam Apr 24 '15 at 15:10
  • `/^[\S\s]+(<\/body>)/i` does that work? – Downgoat Apr 24 '15 at 15:10
  • 1
    @Adam You don't need Regexp for that. Use DOM manipulation methods instead – hindmost Apr 24 '15 at 15:11
  • The second expression should match only when the

    tag is at the very end of the string, though I imagine you will usually have a

    after that. It would be very helpful if you could give some example input to demonstrate what's succeeding and what's failing.

    – Jon Carter Apr 24 '15 at 15:11
  • `while (m=body_re.exec(text)): match next else last tag is the last matched` – Nikos M. Apr 24 '15 at 15:12
  • @hindmost Can you give an example? – Adam Apr 24 '15 at 15:12
  • 1
    `document.body.appendChild` inserts an element right before the closing tag. A regex does not ? – adeneo Apr 24 '15 at 15:13
  • More than 1 body tag??? – Yellen Apr 24 '15 at 15:15
  • @Adam Simple example (IE9+): `var elements = document.querySelectorAll('body'); if (elements.length) elements[elements.length-1].appendChild(document.createElement('your_tag'));` – hindmost Apr 24 '15 at 15:18
  • 1
    Who says more than one body tag is incorrect? There's nothing wrong with having one start tag and one end tag provided everything is in the correct order. Having two body *elements* with or without their own tags, on the other hand... – BoltClock Apr 24 '15 at 15:19
  • [You can't parse (X)HTML with regex.](http://stackoverflow.com/a/1732454/1529630) – Oriol Apr 24 '15 at 15:21

4 Answers4

2

You can use this regex:

  /<\/body>(?![\s\S]*<\/body>[\s\S]*$)/i

(?![\s\S]*<\/body>[\s\S]*$) is a lookahead that ensures there is no more closing body tag before the end of the string.

Here is a demo.

Sample code for adding a tag:

var re = /<\/body>(?![\s\S]*<\/body>[\s\S]*$)/i; 
var str = '<html>\n<body>\n</body>\n</html>\n<html>\n<body>\n</body>\n</html>';
var subst = '<tag/>'; 
var result = str.replace(re, subst);
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Maybe it's just my parser, but I get an error saying that lookaheads have to be zero-width. – Nic Apr 24 '15 at 15:21
  • When trying to use this in Javascript I am getting an error. Invalid regular expression: /(?i)<\/body>(?![\s\S]*<\/body>[\s\S]*$)/: Invalid group – Adam Apr 24 '15 at 15:32
  • Please check with my update. Inline option was a problem I guess – Wiktor Stribiżew Apr 24 '15 at 15:37
1

RegExp

As I suggested in the comments, use:

/^[\S\s]+(<\/body>)/i

How

This will get all text (greedy) until the text </body> the flag i means case-insensitive. This will work no matter how many body tags you have

</body>
</BODY>
</BoDY>
</body><!--This one's selected-->

You said you were using JavaScript which can be used as:

yourString.match(/^[\S\s]+(<\/body>)/i)[1];

.match works fine when you don't have the g flag. To further explain this RegExp

Explanation

^ Matches it at the beginning of the whole string because we don't have the m flag

[\S\s]+ will match everything until the following. The + can be replaced by a *

(<\/body>) will get the body tag after the previous (the last one) and add it as a match

i the i flag makes the string case-insensitive (remove if you want it to be case sensitive)

JavaScript appendChild

If you have multiple body tags, you can still add an element before it.

var elem = document.createElement('div');
elem.setAttribute('id', 'mydiv');
elem.innerHTML = 'Foo';

Now, elem can be added in multiple ways:

1:

window.document.body.appenedChild(elem);

2:

var body_elems = document.getElementsByTagName('body');
body_elems[body_elems.length - 1].appendChild(elem);
Downgoat
  • 13,771
  • 5
  • 46
  • 69
0

Use

/(.|[\r\n])*(<\/body>)/mi

as a regexp. Capture group is $2.

This exploits greedy matching in connection with the multiline option. Note that the 'any char' symbol does not match newlines/carriage returns, which thus need explicit referral.

Ram
  • 143,282
  • 16
  • 168
  • 197
collapsar
  • 17,010
  • 4
  • 35
  • 61
  • 1
    For the record, if you don't want to have two capture groups, you can insert `?:` just after the first `(` to make it a non-capturing group. – Nic Apr 24 '15 at 15:24
0

The regex to match the last body tag is fairly simple:

/[\s\S]*(</body>)/i

What this does is match as many possible of any character (more specifically, any whitespacespace or anything that's not whitespace) before </body>.

The i flag means that it'll match any case for </body>, so anything like:

</body>
</BODY>
</BodY>

Will all match.

I used [\s\S] instead of . because . matches everything but the newline operators, which probably isn't what you want. \s matches all whitespace -- spaces, tabs, every kind of newline -- and \S is equivalent to [^\s], so it matches everything that isn't whitespace. Together, these match every possible character. I'd imagine a similar thing is possible with \w\W, \d\D, etc., but \s\S is my preference.

Nic
  • 6,211
  • 10
  • 46
  • 69