Regex find last body tag

Question

I know that a parser would best be suited for this situation but in my current situation it has to be just straight javascript.

I have a regex to find the closing body tag of an html doc.

var closing_body_tag = /(<\/body>)/i;

However, this fails when source has more than 1 body tag set. So I was thinking about going with something like this..

var last_closing_body_tag = /(<\/body>)$/gmi;

This works for the case when multiple tags are found, but for some reason it is failing on cases with just 1 set of tags.

Am I making a mistake that would cause mixed results for single tag cases?

Yes, I understand more than one body tag is incorrect, however, we have to handle all bad source.

Just curious. Why do you need to find the closing body tag? What are you going to do with that? — hindmost, Apr 24 '15 at 15:09
@adeneo The internet is a mysterious place full of people with bad decisions. It is just a use case we have to handle, correct or incorrect. — Adam, Apr 24 '15 at 15:09
@hindmost We insert a tag right before the last closing body tag. — Adam, Apr 24 '15 at 15:10
@Adam You don't need Regexp for that. Use DOM manipulation methods instead — hindmost, Apr 24 '15 at 15:11
The second expression should match only when the
tag is at the very end of the string, though I imagine you will usually have a
after that. It would be very helpful if you could give some example input to demonstrate what's succeeding and what's failing. — Jon Carter, Apr 24 '15 at 15:11
`while (m=body_re.exec(text)): match next else last tag is the last matched` — Nikos M., Apr 24 '15 at 15:12
`document.body.appendChild` inserts an element right before the closing tag. A regex does not ? — adeneo, Apr 24 '15 at 15:13
@Adam Simple example (IE9+): `var elements = document.querySelectorAll('body'); if (elements.length) elements[elements.length-1].appendChild(document.createElement('your_tag'));` — hindmost, Apr 24 '15 at 15:18
Who says more than one body tag is incorrect? There's nothing wrong with having one start tag and one end tag provided everything is in the correct order. Having two body *elements* with or without their own tags, on the other hand... — BoltClock, Apr 24 '15 at 15:19
[You can't parse (X)HTML with regex.](http://stackoverflow.com/a/1732454/1529630) — Oriol, Apr 24 '15 at 15:21

Wiktor Stribiżew · Accepted Answer · 2015-04-24T21:33:58.923

2

You can use this regex:

  /<\/body>(?![\s\S]*<\/body>[\s\S]*$)/i

(?![\s\S]*<\/body>[\s\S]*$) is a lookahead that ensures there is no more closing body tag before the end of the string.

Here is a demo.

Sample code for adding a tag:

var re = /<\/body>(?![\s\S]*<\/body>[\s\S]*$)/i; 
var str = '<html>\n<body>\n</body>\n</html>\n<html>\n<body>\n</body>\n</html>';
var subst = '<tag/>'; 
var result = str.replace(re, subst);

edited Apr 24 '15 at 21:33

answered Apr 24 '15 at 15:20

Wiktor Stribiżew

607,720
39
448
563

Maybe it's just my parser, but I get an error saying that lookaheads have to be zero-width. – Nic Apr 24 '15 at 15:21
When trying to use this in Javascript I am getting an error. Invalid regular expression: /(?i)<\/body>(?![\s\S]*<\/body>[\s\S]*$)/: Invalid group – Adam Apr 24 '15 at 15:32
Please check with my update. Inline option was a problem I guess – Wiktor Stribiżew Apr 24 '15 at 15:37

Downgoat · Answer 2 · 2015-04-24T16:17:55.657

RegExp

As I suggested in the comments, use:

/^[\S\s]+(<\/body>)/i

How

This will get all text (greedy) until the text </body> the flag i means case-insensitive. This will work no matter how many body tags you have

</body>
</BODY>
</BoDY>
</body><!--This one's selected-->

You said you were using JavaScript which can be used as:

yourString.match(/^[\S\s]+(<\/body>)/i)[1];

.match works fine when you don't have the g flag. To further explain this RegExp

Explanation

^ Matches it at the beginning of the whole string because we don't have the m flag

[\S\s]+ will match everything until the following. The + can be replaced by a *

(<\/body>) will get the body tag after the previous (the last one) and add it as a match

i the i flag makes the string case-insensitive (remove if you want it to be case sensitive)

JavaScript appendChild

If you have multiple body tags, you can still add an element before it.

var elem = document.createElement('div');
elem.setAttribute('id', 'mydiv');
elem.innerHTML = 'Foo';

Now, elem can be added in multiple ways:

1:

window.document.body.appenedChild(elem);

2:

var body_elems = document.getElementsByTagName('body');
body_elems[body_elems.length - 1].appendChild(elem);

score 0 · Answer 3 · edited Apr 24 '15 at 15:20

0

Use

/(.|[\r\n])*(<\/body>)/mi

as a regexp. Capture group is $2.

This exploits greedy matching in connection with the multiline option. Note that the 'any char' symbol does not match newlines/carriage returns, which thus need explicit referral.

edited Apr 24 '15 at 15:20

Ram

143,282
16
168
197

answered Apr 24 '15 at 15:14

collapsar

17,010
4
35
61

1

For the record, if you don't want to have two capture groups, you can insert `?:` just after the first `(` to make it a non-capturing group. – Nic Apr 24 '15 at 15:24

Nic · Answer 4 · 2015-04-24T15:20:39.070

The regex to match the last body tag is fairly simple:

/[\s\S]*(</body>)/i

What this does is match as many possible of any character (more specifically, any whitespacespace or anything that's not whitespace) before </body>.

The i flag means that it'll match any case for </body>, so anything like:

</body>
</BODY>
</BodY>

Will all match.

I used [\s\S] instead of . because . matches everything but the newline operators, which probably isn't what you want. \s matches all whitespace -- spaces, tabs, every kind of newline -- and \S is equivalent to [^\s], so it matches everything that isn't whitespace. Together, these match every possible character. I'd imagine a similar thing is possible with \w\W, \d\D, etc., but \s\S is my preference.

Regex find last body tag

4 Answers4

RegExp

How

Explanation

JavaScript appendChild

1:

2: