0

I need to add slashes to the end of all the image tags in a string. I'm using JavaScript regular expressions. Here is what I have so far:

strInput = strInput.replace(/<img.*">/gm, "");

But I'm not sure what to replace it with? I'm taking the value of a text area and parsing it as XML, but the image tags generate errors because they're HTML. Thanks.

user1177071
  • 93
  • 1
  • 7
  • Be sure too not to use innerHTML to insert the xhtml, because that'll just convert everything back to HTML. You need to use document.createElement to create xhtml. – hobberwickey Jan 30 '12 at 01:22
  • @hobberwickey: your statement "You need to use document.createElement to create xhtml" is false. See my answer. – mathheadinclouds Jun 01 '20 at 22:14
  • @hobberwickey: you simply take any dom node of a document with mime-type HTML, then you add that node to any document of mime-type XHTML (which you can create programmatically), using `.appendChild`, and then you do `.outerHTML` on the node you just appended. Then you have your XHTML. You do not need to do any "recursive tree traversal" stuff. – mathheadinclouds Jun 02 '20 at 13:27
  • I suggest to remove the regex tag. Using regex to do what is asked here is a bad idea, and that that is so is not controversial as far as I can see. So let's remove the tag. – mathheadinclouds Jun 03 '20 at 15:31

2 Answers2

2

you should let the browser do the 'heavy lifting'; obviously, the browser can parse HTML - after all, how else should it show us web pages? You can use JavaScript to make the browser parse HTML for you by setting .innerHTML of some dom node to your HTML string, or by using .insertAdjacentHTML. Then you have transformed your HTML string to a tree of DOM nodes, i.e., you have it parsed.

And there are browser builtin ways to turn your DOM tree into an XHTML string. You simply create an XHTML document programmatically, then you add any DOM tree to it (which can come from an HTML (non-XHTML) document, that is perfectly fine) with .appendChild, and then the .outerHTML and .innerHTML methods of your DOM tree (which now have an XHTML document as owner document) will give XHTML.

If you're starting with a DOM node, you can use the following 2 functions:

var nsx = "http://www.w3.org/1999/xhtml";
function outerXHTML(node){
    var xdoc = document.implementation.createDocument(nsx, 'html');
    xdoc.documentElement.appendChild(node);
    return node.outerHTML;
}
function innerXHTML(node){
    var xdoc = document.implementation.createDocument(nsx, 'html');
    xdoc.documentElement.appendChild(node);
    return node.innerHTML;
}

(note that the node will be owned by the newly created XHTML document, so it will vanish from your original document. If it should remain there, then clone it before calling one of the above functions.)

And if you're starting with a string, we'll just have to set innerHTML of a newly created node before calling the above. For you convenience, here is a snippet. With 3 examples. 2 for html to xhtml, and one for xhtml to html.

function html2xhtml(html){
    var nsx = "http://www.w3.org/1999/xhtml";
    var body = document.createElement('body');
    body.innerHTML = html;
    var xdoc = document.implementation.createDocument(nsx, 'html');
    xdoc.documentElement.appendChild(body);
    return body.innerHTML;
}
function xhtml2html(xhtml){
    var body = document.createElement('body');
    body.innerHTML = xhtml;
    var doc = document.implementation.createHTMLDocument();
    doc.documentElement.appendChild(body);
    return body.innerHTML;
}
var html1 = '<div>lorem<img>ipsum<img>dolor sit amet<br></div>';
var html2 = '<ul><li><svg><rect width="100" height="100"></rect></svg></li></ul>';
var html3x = '<img />';
var node1  = document.getElementById('node1');
var node1x = document.getElementById('node1x');
var node2  = document.getElementById('node2');
var node2x = document.getElementById('node2x');
var node3  = document.getElementById('node3');
var node3x = document.getElementById('node3x');
node1.textContent = html1;
node2.textContent = html2;
node3x.textContent = html3x;

node1x.textContent = html2xhtml(html1);
node2x.textContent = html2xhtml(html2);
node3.textContent = xhtml2html(html3x);
html<br><pre id='node1'></pre>xhtml<br><pre id='node1x'></pre><hr>
html<br><pre id='node2'></pre>xhtml<br><pre id='node2x'></pre><hr><hr>
xhtml<br><pre id='node3x'></pre>html<br><pre id='node3'></pre>

code older version

you can also do it with XMLSerializer (for the toString part not the fromString part), credit @Kaiido.

mathheadinclouds
  • 3,507
  • 2
  • 27
  • 37
0

You'll have to use a capture group:

strInput = strInput.replace(/(<img[^>]+)>/gm, "$1 />");

Here's the fiddle: http://jsfiddle.net/ChNnU/

Joseph Silber
  • 214,931
  • 59
  • 362
  • 292
  • @icktoofay - Which is exactly why [you shouldn't really use a regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454). – Joseph Silber Jan 29 '12 at 23:50
  • 1
    Then why are you suggesting it as an answer? – icktoofay Jan 29 '12 at 23:52
  • @icktoofay - Because the OP was asking for a regex. Honestly, this'll work in 99.99% of cases. If you need this to be 100% bulletproof then, again, a regex is probably not the right tool for the job. – Joseph Silber Jan 29 '12 at 23:53
  • Cool. Thanks! This works. icktoofay: The HTML is coming from a CMS, so any invalid HTML has already been filtered out. – user1177071 Jan 29 '12 at 23:57
  • @icktoofay - I specifically said *probably*, because [I know that is **is** possible](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) - it's just not very practical. – Joseph Silber Jan 30 '12 at 00:00
  • You should mention that this is Perl not JavaScript. BNF like regex aren't possible in JavaScript. – Saxoier Jan 30 '12 at 00:06
  • @Saxoier I'm not sure what you mean, this example works perfectly well in Javascript – hobberwickey Jun 02 '20 at 13:28