innerHTML
/html()
won't give you an XHTML serialisation unless you've actually served the page as XHTML, under the application/xhtml+xml
media type. (And that doesn't work on IE<9.) If you are serving your page as text/html
, then your self-closing tags are nothing but ignored clutter to the browser when it's parsing your source into a DOM. You cannot expect to get the same format HTML out of the DOM serialised from that as you put in.
In fact in some cases in IE innerHTML
won't even give you a valid HTML serialisation: it omits quotes around attr in some cases where it shouldn't. In short, you cannot rely on innerHTML
giving you any particular format of markup. It might re-order attributes, it might HTML-escape different characters, it might normalise attribute values, it might change whitespace. So doing string operations on the html()
return value is a non-starter. All you can rely on is that you can assign the serialised markup back to the innerHTML
of another element and the browser will be able to parse it.
What's your purpose in trying to retrieve XHTML? You may be able to achieve more using normal DOM-style manipulations.
ETA re comment:
Then XHTML validity is the least of your worries. It doesn't matter if the HTML isn't well-formed, you will still be able to write it back to html()
. But:
What you can't reliably do with the html()
is tell what sentences are in text content and what are in attribute values. For example <img title="Hello, this is some description. Another sentence.">
is markup and if you start putting <span>
s inside the title
attribute, you're obviously going to have difficulties.
This is a text-processing task, so you should do it on text nodes, not markup. This is a bit tricky and jQuery doesn't give you any special tools to do it. But see the findText
function from this answer and you could use it like:
// Split each text node into things that look like sentences and wrap
// each in a span.
//
var element= $('#content')[0];
findText(element, /.*?[.?!]\s+?/g, function(node, match) {
var wrap= document.createElement('span');
node.splitText(match.index+match[0].length);
wrap.appendChild(node.splitText(match.index));
node.parentNode.insertBefore(span, node.nextSibling);
});