Get contents of within a string

Question

I want to do the following.

$("a").click(function (event) {

    event.preventDefault();

    $.get($(this).attr("href"), function(data) {

        $("html").html(data);

    });

});

I want the behavior of all hyperlinks to make a ajax calls and retrieve the html.

Unfortunately you cannot simply replace the current html with the html you receive in the ajax response.

How can grab only what is within the <body> </body> tags of the ajax response so that i can replace only the contents of the body in the existing html.

Edit: the <body> opening tag will not always be just <body> it may sometimes have a class e.g.

<body class="class1 class2">

Why would you like to Replace all the contents of the BODY? **PUN INTENDED** — Sujit Agarwal, Jun 01 '11 at 02:34
The best answer I've found for this is here: https://stackoverflow.com/questions/3628374/how-to-extract-body-contents-using-regexp/3642850#3642850 — nabrown, Aug 23 '22 at 21:10

Rob Raisch · Accepted Answer · 2011-06-01T19:01:46.510

11

If I understand you correctly, grab the content between the body tags with a regex.

$.get($(this).attr("href"), function(data) {
    var body=data.replace(/^.*?<body>(.*?)<\/body>.*?$/s,"$1");
    $("body").html(body);

});

EDIT

Based on your comments below, here's an update to match any body tag, irrespective of its attributes:

$.get($(this).attr("href"), function(data) {
    var body=data.replace(/^.*?<body[^>]*>(.*?)<\/body>.*?$/i,"$1");
    $("body").html(body);

});

The regex is:

^               match starting at beginning of string

.*?             ignore zero or more characters (non-greedy)

<body[^>]*>     match literal '<body' 
                    followed by zero or more chars other than '>'
                    followed by literal '>'

(               start capture

  .*?           zero or more characters (non-greedy)

)               end capture

<\/body>        match literal '</body>'

.*?             ignore zero or more characters (non-greedy)

$               to end of string

Add the 'i' switch to match upper and lowercase.

And please ignore my comment regarding the 's' switch, in JavaScript all RegExp are already single-line by default, to match a multiline pattern, you add 'm'. (Damn you Perl, interfering with me when I'm writing about JavaScript! :-)

edited Jun 01 '11 at 19:01

answered Jun 01 '11 at 02:33

Rob Raisch

17,040
4
48
58

1

I don't think that regex works, i did a console.log on the body variable and it was still returning all of the html, not just what was within the body tags. – aprea Jun 01 '11 at 02:49
Running the following code: var page='foobody'; page.replace(/^.*?(.*?)<\/body>.*?$/, "$1"); provides: 'body' as its answer. – Rob Raisch Jun 01 '11 at 02:51
Ahh...forgot to mention, for multi-line content, you'll need to add the 's' flag to the regex to treat the entire string as a single line. (Example has been edited.) – Rob Raisch Jun 01 '11 at 03:12
With the edited regex im getting 'SyntaxError: invalid regular expression flag s' in firebug – aprea Jun 01 '11 at 03:19
1

I forgot to mention, the `` tag will not always be just `` it will sometimes have a class, `` would it be possible for you to update the regex to accommodate for this? – aprea Jun 01 '11 at 04:11
1

wow, thanks for detailed reply rob, unfortunately i still cannot get it to work. if you change `$("body").html(body);` in your script to `console.log (body);` then run the script in firebug on this particular stackoverflow page and click a hyperlink somewhere, you'll see that it's still returning the whole page from `` to `` – aprea Jun 02 '11 at 01:07
`$.get($(this).attr("href"), function(data) { var body=data.replace(/^.*?]*>(.*?)<\/body>.*?$/i,"$1"); console.log (body); })` – aprea Jun 02 '11 at 03:03
@RobRaisch I don't see the point of the backslash in the closing body tag. There's no need to escape the following slash, is there? – Mar 22 '13 at 22:06
If I understand your question, since the RegExp is delimited by slashes, it is critical to escape any slashes contained within it. To do otherwise would cause the JavaScript parser to croak. – Rob Raisch Mar 24 '13 at 14:57
how do I modify this to select multiple parts of the html ? suppose from the html content, I need the content inside followed by the content inside – clearScreen Jul 03 '15 at 08:18
Keep in mind the dangers of the famous [You can't parse [X]HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) answer (currently with 4426 upvotes). – Peter V. Mørch Mar 04 '17 at 11:16
1

@PeterV.Mørch Indeed, but there is a profound difference between extracting the contents of clearing delimited, non-repeating containers like `
` or `
`, and attempting extract deeply-nested content. Regexen are a perfect solution for the former and as has been noted many times, inappropriate for the later.
– Rob Raisch Mar 10 '17 at 00:08
1

@RobRaisch I upvoted your comment because it has merrit. However my claim still stands: Regexes are not up for the job. A `` can contain a ` – Peter V. Mørch Mar 11 '17 at 15:11

score 1 · Answer 2 · answered Mar 04 '17 at 12:27

I didn't want to mess with regular expressions. Instead, I created a hidden <iframe>, loaded the contents in it, and extracted the <body> from the page in the <iframe> in the page's onload().

I needed to be careful with Same-origin policy for the the iframe (this article showed the way):

var iframe = document.createElement('iframe');
iframe.style.display = "none";
jQuery('body').append(iframe);
iframe.contentWindow.contents = data;
iframe.onload = function () {
    var bodyHTML = jQuery(iframe).contents()
                        .find('body').html();
    // Use the bodyHTML as you see fit
    jQuery('#error').html(bodyHTML);
}
iframe.src = 'javascript:window["contents"]';

Just remove the <iframe> when you're done...

Chris Baker · Answer 3 · 2017-03-06T05:25:25.523

-1

Be sure to bind events to the document, filtered by class ($(document).on('click', '.my-class-name', doThings);) If you replace the html of the body, any event bindings done directly ($('.my-class-name').on('click', doThings);) will be destroyed when DOM is redrawn using the new html. Rebinding will work, but it will also leave a bunch of pointers from the old events and nodes that the garbage collector will have to clean up -- in simpler terms it may make the page get heavier and heavier the longer it is open.

I have not tested this on multiple platforms, use with caution.

// create a new html document
function createDocument(html) {
  var doc = document.implementation.createHTMLDocument('')
  doc.documentElement.innerHTML = html
  return doc;
}
$("a").click(function (event) {
    event.preventDefault();
    $.get($(this).attr("href"), function(data) {
        $("body").html($(createDocument(data)).find('body').html);
    });
});

edited Mar 06 '17 at 05:25

answered Jun 01 '11 at 19:11

Chris Baker

49,926
12
96
115

I would've loved for this to work. But `jQuery('foobar').find('body').length == 0` :-( For that reason I'm downvoting. – Peter V. Mørch Mar 04 '17 at 11:10
Weirdly, `jQuery('
foobar
').find('span').length == 1`, but I can't extract a `` from a `` – Peter V. Mørch Mar 04 '17 at 11:18
@PeterV.Mørch I added a function there to make a new html document first. This seems to work -- can you confirm? – Chris Baker Mar 06 '17 at 05:20

Get contents of within a string

3 Answers3

Linked