how to extract body contents using regexp

Question

I have this code in a var.

<html>

    <head>
        .
        .
        anything
        .
        .
    </head>

    <body anything="">
        content
    </body>

</html>

or

<html>

    <head>
        .
        .
        anything
        .
        .
    </head>

    <body>
        content
    </body>

</html>

result should be

content

What @marcgg is saying is that thou shalt not parse HTML with regex. — Pekka, Sep 02 '10 at 15:04
This question gets asked on an hourly basis for some reason. Hence his frustration. — fredley, Sep 02 '10 at 15:05
The question is not about parsing HTML - it is aboute extracting the contents of BODY — donohoe, Sep 02 '10 at 15:48
So I arrived here because I, too, have reached the point where I want to use a regex. Until now I did it properly, using a DOMParser. The reason: Chrome is so concerned that I might lose the namespace that it adds an xmlns attribute to EVERYTHING to moment I use innerHTML to extract the body. I don't want that. I can't find any way to convince it otherwise :-( — izak, Feb 18 '14 at 13:58

score 25 · Answer 1 · edited Aug 13 '19 at 13:05

25

Note that the string-based answers supplied above should work in most cases. The one major advantage offered by a regex solution is that you can more easily provide for a case-insensitive match on the open/close body tags. If that is not a concern to you, then there's no major reason to use regex here.

And for the people who see HTML and regex together and throw a fit...Since you are not actually trying to parse HTML with this, it is something you can do with regular expressions. If, for some reason, content contained </body> then it would fail, but aside from that, you have a sufficiently specific scenario that regular expressions are capable of doing what you want:

const strVal = yourStringValue; //obviously, this line can be omitted - just assign your string to the name strVal or put your string var in the pattern.exec call below 
const pattern = /<body[^>]*>((.|[\n\r])*)<\/body>/im;
const array_matches = pattern.exec(strVal);

After the above executes, array_matches[1] will hold whatever came between the <body and </body> tags.

edited Aug 13 '19 at 13:05

lesyk

3,979
3
25
39

answered Sep 04 '10 at 15:25

Jeffrey Blake

9,659
6
43
65

1

This explains why regex is a bad choice for parsing Html http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Doug Jul 05 '15 at 14:05
4

@Doug There is a huge difference between trying to parse HTML at a high level and trying to extract the specific contents of an individual tag. Especially a tag that both the question and answer indicate occurs exactly once in all source material. – Jeffrey Blake Jul 06 '15 at 14:22
This breaks if you run it on a "p" tag for example. it will return all content included between the first
and the last
found – kilianc Nov 30 '15 at 02:49
1

@kilianc Yes, as written it was intended exclusively for the `` tag (though it could also be used for any tag that occurs exactly once in a correctly written HTML doc, such as ``). To use it for repeating tags, you'd need to make some modifications. But that's not what the question here was asking. – Jeffrey Blake Dec 01 '15 at 03:42
@Jeffrey Blake can you share what to modify? – leechyeah Jan 15 '19 at 10:58

score 1 · Answer 2 · answered Dec 23 '11 at 13:09

1

var matched = XMLHttpRequest.responseText.match(/<body[^>]*>([\w|\W]*)<\/body>/im);
alert(matched[1]);

answered Dec 23 '11 at 13:09

Catalin Enache

758
1
10
17

Doug · Answer 3 · 2018-11-21T22:51:12.853

I believe you can load your html document into the .net HTMLDocument object and then simply call the HTMLDocument.body.innerHTML?

I am sure there is even and easier way with the newer XDocumnet as well.

And just to echo some of the comments above regex is not the best tool to use as html is not a regular language and there are some edge cases that are difficult to solve for.

https://en.wikipedia.org/wiki/Regular_language

Enjoy!

how to extract body contents using regexp

3 Answers3

Linked

Related