2

I'm trying to create script that copies user submitted HTML into a jQuery object, manipulates it, and then gives it back to the user as plain text. I have a <textarea> that the user pastes their HTML into and then submits. At which point I grab the value of that <textarea> and create the jQuery object so that I can use jQuery to modify it.

However, I've only just recently noticed that things like the <!doctype html>, <html> tag and <body> tag don't seem to be in the object. Can these things not exist in a jQuery object? I tested this by putting a <body> tag into a jQuery object and then using .find(). I didn't get any results.

Additionally, when I use this code from How do you convert a jQuery object into a string?

$('<div>').append($('#item-of-interest').clone()).html();

The <body> tag is missing. Although, I'm not sure if that's just because of the method I'm using the output a string from a jQuery object or not.

Community
  • 1
  • 1
jkupczak
  • 2,891
  • 8
  • 33
  • 55
  • A couple of questions arises. Why would you need a `body` tag, you can really only have one `html` and `body` tag, but you can create a new DOM or fragment etc. Why would you let the user paste in HTML at all, it's generally not a very good idea? – adeneo Aug 02 '14 at 22:16
  • This is specifically for internal use only at my company. So I'm not concerned with security. I need the body tag because the user is going to submit the full source code from an HTML page. After I'm done manipulating it I'm going to give it back to them. If they get it back without the body tag etc, it's going to be a hassle for them. – jkupczak Aug 02 '14 at 22:18
  • Trying `jQuery("")` at the debug-window and that will return a body-tag, so it can handle it. But with content in the body tag `jQuery("test")` there is only a text node. So I assume something happens when it parses the string. – some Aug 02 '14 at 22:27
  • @some Same thing happens with `jQuery("test")` and if I use `` as well. All I get is the text node. `
    ` works though. I get the text AND the `
    `. It doesn't seem like a jQuery object will handle anything that's the `` tag or normally is suppose to exist only outside of it.
    – jkupczak Aug 02 '14 at 23:16
  • @jfriend00 I tried with a body tag in a documentFragment in Chrome, FF, IE and Opera, and it works. `var a=document.createDocumentFragment(); a.appendChild(document.createElement('body')); a.firstChild.appendChild(document.createTextNode('test')); console.log(a.firstChild.outerHTML);` – some Aug 03 '14 at 00:07
  • The problem is probably because the div-tag isn't supposed to have doctype, html, head or body tags... Tried the following on Chrome, FF, IE and Opera `var div,text; text=" Test

    test

    "; div=document.createElement('div'); div.innerHTML = text; console.log(div.innerHTML); ` and all of them returns "Test

    test

    " (doctype, html, head and body striped)
    – some Aug 03 '14 at 00:21
  • If you determinate that it has a `html`-tag, you could do `text=" Test

    test

    "; html=document.createElement('html'); html.innerHTML = text; console.log(html.outerHTML); ` :) You have to handle the doctype separately. You could also test for head and body and use innerHTML to populate them.
    – some Aug 03 '14 at 00:32

2 Answers2

8

If you follow into the jQuery code, internally, in the jQuery object constructor, once it determines that you've passed in an HTML string, then it calls jQuery.parseHTML() on that string. If you follow into the parseHTML() method, if the HTML is not a single tag only, then it then calls buildFragment() on the same HTML string and if you follow into it you will find that it discards the <body> tag. I don't know why it does that, but that's the way it is coded to behave.

So, there's this type of code flow:

jQuery object constructor
    determine if argument is an HTML string
    call jQuery.parseHTML() on the HTML string
       if string is not a single tag by itself, 
           then call jQuery.buildFragment() on the string
           jQuery.buildFragment() seems to ignore the outer tag container

I have not been able to figure out why buildFragment() ignores the outer <body>other content here</body>, but it does.

On further study of buildFragment(), it correctly parses the outer tag as <body>, but as long as that tag isn't a tag type that needs some special treatment (such as the kinds of things that can only exist inside of tables), it completely ignores what type that outer tag was and forces it to be a <div>. That outer container is then ignored later, when the content is retrieved from the jQuery object. Again, I'm not sure why it does that, but that is what it does.


As for your particular problem, I think the conclusion is that you can't use jQuery's constructor to handle an entire HTML document. It just isn't built to do that.

You could search the HTML document that was given to you and extract just the part between <body> and </body>, give that to the jQuery object constructor, do your manipulations on it, then put the manipulated HTML back into the original whole HTML document between the original <body> and </body> tags, thus preserving everything that you didn't want to manipulate while using jQuery for the part internal to the <body> tag.

You should probably also be wary of <script> elements in the <body> tag as they probably aren't preserved perfectly either.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
1

Since this is going to be used on an internal application, the function below might be of interest, even if it doesn't use jQuery (you can always call jQuery on the element that is returned)

It takes a string, and put it inside a HTML-element and let the browser handle the tag soup. It will return a html-element that always has a head and a body.

This isn't perfect, but it does a lot of the work. And with the little testing I have done it gives the same result in Chrome 36, Firefox 31, Opera 21 and Internet Explorer 11.

It strips the doctype tag, and the html-tag. If you have attributes on the html-tag they will be lost. But you get a html-element that always has a head and body, even if the input doesn't. When I tested the script-tags was not executed. I haven't tried audio/video-tags, svg etc...

With a little bit of extra code you should be able to get the attributes on the html-element, and put the doctype in a string.

function mkDom(text) {
  var html;
  html=document.createElement('html');
  html.innerHTML = text;
  return html;
}

Test with complete document:

console.log(mkDom("<!doctype html><html lang='en'><head><title>Test</title><script src='test.js'></script></head><body><p>test</p><script>alert(1);</script></body></html>").outerHTML);

<html><head><title>Test</title><script src="test.js"></script></head><body><p>test</p><script>alert(1);</script></body></html>

Test with head and body:

console.log(mkDom("<head><title>Test</title><script src='test.js'></script></head><body><p>test</p><script>alert(1);</script></body>").outerHTML);

<html><head><title>Test</title><script src="test.js"></script></head><body><p>test</p><script>alert(1);</script></body></html>

Test with body:

console.log(mkDom("<body><p>test</p><script>alert(1);</script></body>").outerHTML);

<html><head></head><body><p>test</p><script>alert(1);</script></body></html>

Test with partial body:

console.log(mkDom("<p>test</p><script>alert(1);</script>").outerHTML);

<html><head></head><body><p>test</p><script>alert(1);</script></body></html>
some
  • 48,070
  • 14
  • 77
  • 93