RegEx to get the last tag in HTML

Question

I am trying to write a regular expression in my Node.js application that gets the last </body> tag on a page. The issue that I am running into is that some HTML pages have iframes inside them that add additional </body>. I've tried a bunch of different things but I just can't get around this issue.

You could always investigate use of the `lastIndexOf` member of strings. — enhzflep, Jun 30 '14 at 00:56
Obligatory link to this answer: http://stackoverflow.com/a/1732454/2057919 — elixenide, Jun 30 '14 at 00:57
@porneL, EdCottrell: He's not trying to parse an HTML document. People need to stop linking to that Q&A unnecessarily. — cookie monster, Jun 30 '14 at 01:10
Are you saying you want all the content before the last `
`? I can't tell specifically what you want, but `/([\s\S]+)<\/body>/` will give you all the content in the capture group up until the last `` tag in the document. You can make it case insensitive with the `i` modifier. — cookie monster, Jun 30 '14 at 01:16

score 4 · Answer 1 · edited May 23 '17 at 10:26

4

You should use an HTML parser instead, e.g. https://github.com/cheeriojs/cheerio

In general HTML syntax is not regular and hence impossible to correctly match using a regular expression.

However, since there can only be one <body> in the document it may actually be possible to find just its closing tag using a regex without invoking Zalgo, because you don't need to create a full parse tree, you just need to tokenize the stream. But in HTML5 there are still some crazy tokenizer states and reparsing rules (e.g. recovery from unclosed <script>), and I'm not quite sure if they're possible to express with a regular expression.

But if you simply use an HTML parser, it will save you hassle of dealing with fun cases such as:

<!-- </body -->
<iframe srcdoc="yup, that's valid</body>"></iframe>
<script>alert("</body> yet?");/*
</body> not this one
*/</script>
</BoDy
>
<-- ^^ it was the one above, or was it? </body>

Oh, and a valid HTML document doesn't need to have an explicit </body> at all! It's automatically implied by </html> or the end of the document.

edited May 23 '17 at 10:26

Community

1
1

answered Jun 30 '14 at 00:55

Kornel

97,764
37
219
309

Just to find one simple, known closing tag? I don't think so. – cookie monster Jun 30 '14 at 01:12
1

@cookiemonster Finding stuff in HTML is parsing, and regular expressions are a tool for parsing, just not for the right type of grammar. And your regex fails on my example :) – Kornel Jul 01 '14 at 02:26
Finding stuff in HTML is no different than finding stuff in any other string if the target is known to exist. And yes, we can make contrived examples, but if the document is known to have a reliable structure, then parsing an entire document would be silly. Even if it isn't known and guaranteed, I'd still not parse the entire thing just to find the end of the body. It would make more sense to make a custom parser that would work from the end and handle the most common cases. At times there could be extra cruft included, but that's no different than the contrived example you presented. – cookie monster Jul 01 '14 at 02:39
1

@cookiemonster syntaxes have different types of grammars and regex is proven to be unable to match some of them, e.g. grammars that allow infinite nesting: regex can only match nesting up to a fixed maximum depth. And some grammars are *way* too complex to match with a regex, e.g. try writing a regex that matches last words of only sentences containing sarcasm. – Kornel Jul 01 '14 at 02:49
You seem to be stuck in some theoretical scenario unrelated to the question. He's not parsing an entire document. He's looking for something very specific. And again, even if it's not known to exist at a reliable position, a simple custom parser would make more sense than parsing the whole document. – cookie monster Jul 01 '14 at 02:52
1

@cookiemonster the OP is parsing an entire document, he's just not building a DOM. However, to find a "last" thing in HTML you need to correctly skip over all previous ones (and that's called parsing, even if you throw away the data or don't think it's serious enough). And I've given real examples of valid (and not so valid, but still parseable by all browsers) HTML that fools naive regexes. – Kornel Jul 01 '14 at 02:58
There's no indication that the OP is parsing the entire document, or at least not fully parsing it as HTML. But once again you seem to be missing the point. If the document in question is known to have a reliable structure, then parsing the entire thing doesn't make sense. If not, then there's almost certainly going to be junk included that isn't desired. If there's invalid stuff after the closing `
` the browser throws it into the body. So the most sensible thing to do would probably be to start from the end and have a simple parser that can make as much sense of things as possible.
– cookie monster Jul 01 '14 at 03:07

score 2 · Answer 2 · edited May 23 '17 at 12:20

2

Don't use a regex to parse html.

node.js has a number of modules that can help you with this:

edited May 23 '17 at 12:20

Community

1
1

answered Jun 30 '14 at 00:55

matt

9,113
3
44
46

This doesn't answer the question. He's not trying to use regex to parse an HTML document. He's looking for one specific thing. – cookie monster Jun 30 '14 at 01:11
1

I'm 100% aware of that - that's because using a regex to parse html documents is generally, and rightly, considered to be a bad practice. If there were a question along the lines of "How do I insert data into this database without a non-parameterized sql statement," or "How can I keep cleartext passwords safe in my database?," I'd also respond with a statement of "this is a bad idea, don't do it." – matt Jun 30 '14 at 01:17
You're 100% aware of *what*? I'm saying that the idea that using a regex to parse HTML documents is bad practice has nothing to do with the question he's asking because he's ***not*** trying to parse an HTML document. – cookie monster Jun 30 '14 at 01:19
1

I'm aware that I didn't answer the question of how to find something in an html document using a regex. HTML documents represent structured data, and regex isn't the correct tool in this case. For example, body tags are *not* required to be closed - if you use a regex, how are you going to cases where a body tag isn't closed? Or a self-closing body tag is used? Or a closing body tag appears in a comment? Regex is powerful, and has many uses - this just isn't one of them. – matt Jun 30 '14 at 01:55
1

You assume *far* more than the question states. Of course body elements are not required to be closed. That doesn't meant that the particular document in question can't be relied upon to hold a closing tag. We can spend all day imagining scenarios that will break, or we can come back to earth and look at the actual situation presented in the question. If it's simply a string of text with a specific substring that needs to be matched, then that's exactly what regex is for. Parsing it out entirely would be silly. *"Don't use regex to parse html"* has nothing to do with what the question suggests – cookie monster Jun 30 '14 at 02:57

Tiberiu C. · Answer 3 · 2014-06-30T01:34:03.967

0

Regular expresions ware never meant to parse documents avoid using them at all costs when it comes to parsing more than a line, they are very slow.

Nevertheless If you really insist match and than take the last result, as far as I know there is no reverse search in RegEx.

edited Jun 30 '14 at 01:34

answered Jun 30 '14 at 01:02

Tiberiu C.

3,365
1
30
38

He's not parsing the document. What do you mean by reverse search? – cookie monster Jun 30 '14 at 01:12
Looking for elements in a text is parsing, right? By "reverse" I meant search from the end of the string backwards. – Tiberiu C. Jun 30 '14 at 01:23
No, looking for known substrings in text is very simply searching a string. It doesn't matter if the string is HTML or not. Parsing HTML would be taking an entire document and converting it from the HTML serialization into a fully representative object structure. That's not what the question is asking about. Can you show a demonstration of reverse search using a RegExp in JavaScript? – cookie monster Jun 30 '14 at 01:25
Oh ... a 'w' slipped in my text .. to make it work dough .. as a poor man solution reverse the string and find the first match in reverse :)) – Tiberiu C. Jun 30 '14 at 01:33

RegEx to get the last tag in HTML

3 Answers3