How to convert HTML to valid XHTML?

Question

I have a string of HTML, in this example it looks like

<img src="somepic.jpg" someAtrib="1" >

I am trying to workout a peice of regex that will match the 'img' node and apply a slash to the end of the node so it looks like.

<img src="somepic.jpg" someAtrib="1" />

Essentially the end goal here is to ensure that the node is closed, open nodes are valid in HTML but not XML obviously. Are there any regex buff's out there able to help?

You should not [parse (X)HTML with regex.](http://stackoverflow.com/a/1732454/451590). HTML is not regular, and as such is a bad candidate for regular expressions. Use a full-fledged HTML parser. — David B, Aug 23 '12 at 13:22
@DavidB I understand what you are saying. however I am attempting to manipulate a 'string', this is why I am asking the question :) — John, Aug 23 '12 at 13:28
The original tag is not valid, and neither is the requested XHTML tag. Do you actually mean “well-formed” and not “valid”? — Jukka K. Korpela, Aug 23 '12 at 16:18

score 18 · Accepted Answer · edited Jan 19 '20 at 09:30

18

Don't use a Regular expression, but dedicated parsers. In JavaScript, create a document using the DOMParser, then serialize it using the XMLSerializer:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);
// result:
// <html xmlns="http://www.w3.org/1999/xhtml"><head></head><body> (no line break)
// <img src="foo" /></body></html>

You have to use xmldom if you required to use this with nodejs backend. npm i xmldom.

edited Jan 19 '20 at 09:30

Supun Induwara

1,594
3
14
22

answered Aug 23 '12 at 13:38

Rob W

341,306
83
791
678

Hey Rob. I'd like to learn how to use this technique. I am running Opera9, IE6 and FF2 (all my code must run on older browsers for backward compatibility) and the above code does not work as-is. What else needs to be included to get this to work? Could you post a complete working function? Thanks. – ridgerunner Aug 23 '12 at 15:03
1

@ridgerunner It's supported by IE9+, FF 12+ (4+ with DOMParser text/html polyfill), Opera 11.6+ (w/ DOMParser polyfill). Chrome (21) has a bug were `/>` is not added. Sorry that I didn't elaborate much, I'm quite busy atm. Feel free to edit my/your answer to make it more complete if you wish. – Rob W Aug 23 '12 at 15:22
Still doesn't work in Chrome (34), so it's a cool technique but not if your users are using Chrome. – rossdavidh May 02 '14 at 21:18

score 4 · Answer 2 · edited Dec 08 '22 at 12:32

You can create a xhtml document and import/adopt html elements. Html strings can be parsed by HTMLElement.innerHTML property, of course. The key point is using Document.importNode() or Document.adoptNode() method to convert html nodes to xhtml nodes:

var di = document.implementation;
var hd = di.createHTMLDocument();
var xd = di.createDocument('http://www.w3.org/1999/xhtml', 'html', null);
hd.body.innerHTML = '<img>';
var img = hd.body.firstElementChild;
var xb = xd.createElement('body');
xd.documentElement.appendChild(xb);
console.log('html doc:\n' + hd.documentElement.outerHTML + '\n');
console.log('xhtml doc:\n' + xd.documentElement.outerHTML + '\n');
img = xd.importNode(img); //or xd.adoptNode(img). Now img is an xhtml element
xb.appendChild(img);
console.log('xhtml doc after import/adopt img from html:\n' + xd.documentElement.outerHTML + '\n');

The output should be:

html doc:
<html><head></head><body><img></body></html>

xhtml doc:
<html xmlns="http://www.w3.org/1999/xhtml"><body></body></html>

xhtml doc after import/adopt img from html:
<html xmlns="http://www.w3.org/1999/xhtml"><body><img /></body></html>

Rob W's answer does not work in chrome (at least 29 and below) because DOMParser does not support 'text/html' type and XMLSerializer generates html syntax(NOT xhtml) for html document in chrome.

This seems to be a better solution than Rob W's. importNode() has a second parameter, if you also want to convert descendant elements — fishbone, Jan 15 '15 at 12:19

score 2 · Answer 3 · edited May 23 '17 at 12:09

2

In addition to Rob W's answer, you can extract the body content using RegEx:

var doc = new DOMParser().parseFromString('<img src="foo">', 'text/html');
var result = new XMLSerializer().serializeToString(doc);

/<body>(.*)<\/body>/im.exec(result);
result = RegExp.$1;

// result:
// <img src="foo" />

Note: parseFromString(htmlString, 'text/html'); would throw error in IE9 because text/html mimeType is not supported in IE9. Works with IE10 and IE11 though.

edited May 23 '17 at 12:09

Community

1
1

answered Nov 18 '13 at 20:33

Annie

3,090
9
36
74

Why using regexp? You can simply use doc.body.innterHTML – Krunoslav Djakovic Oct 01 '16 at 12:17
Correcting myself. innerHTML will eg return
instead of
. But this regexp pattern works better http://stackoverflow.com/questions/3628374/how-to-extract-body-contents-using-regexp – Krunoslav Djakovic Oct 01 '16 at 16:30

ridgerunner · Answer 4 · 2012-08-23T14:39:32.687

1

This will do a pretty good job:

result = text.replace(/(<img\b[^<>]*[^<>\/])>/ig, "$1 />");

Addendum: In the (unlikely) event that your code contains tag attributes containing angle brackets (which is not vaild XML/XHTML BTW), then this one will do a little better job:

result = text.replace(/(<img\b(?:[^<>"'\/]+|'[^']*'|"[^"]*")*)>/ig, "$1 />");

edited Aug 23 '12 at 14:39

answered Aug 23 '12 at 13:55

ridgerunner

33,777
5
57
69

@John The reason that a Regular expressions must **not** be used for creating XHTML conforming documents is that it's not reliable. This answer fails already at ``, for example. The output is ``. – Rob W Aug 23 '12 at 14:04

How to convert HTML to valid XHTML?

4 Answers4