Since the DOM has already been broken, you need to take a step back and try to salvage the HTML.
1) Find the parents of the broken elements. While search&replace inside the document.body.innerHTML
would probably work, you shouldn't really let regexes anywhere near large chunks of HTML. Performance is a concern as well, albeit a lesser one.
<img alt="img" src="<a href="http://...
will get parsed by the browser as an image with the source "<a href=
".
With jQuery, you can simply ask $('img[src="<a href"]')
to get the images. Except in IE<8, you can use querySelectorAll
with the same selector. If you don't have jQuery, and want to support IE7, you need to use getElementsByTagName
with manual filtering.
If you are really lucky, you can find the parent via getElementByID
(or the equivalent jQuery).
This is the easy part.
2) Your HTML doesn't validate, and the browser had already made some effort to fix it. You need to reverse the process. Predicting the browser actions is problematic, but let's attempt to.
Let's see what the browser does with
<img src="<a href="http://www.test.com/img/image-20x20.png">http://www.test.com/img/image-20x20.png</a>" style="margin:5px" />
This is how Chrome and Firefox fix it:
<img src="<a href=" http:="" www.test.com="" img="" image-20x20.png"="">http://www.test.com/img/image-20x20.png" style="margin:5px" />
IE9 sorts the attributes within img
alphabetically in innerHTML
(o_0) and doesn't HTML-escape the <
within src. IE7-8 additionally strip =""
from the attributes.
The image attributes will be hard to salvage, but the text content is unharmed. Anyways the pattern can be seen:
everything starting at <img
and until src=
should be preserved. Unfortunately, in IE, the arguments are rearranged, so you have to preserve the incorrect tags as well. src="..."
itself must be removed. Everything past that is [incorrect] in modern browsers, but in IE, proper attributes could have crept there (and vice versa). Then the image tag ends.
Everything that follows is the real URL, up until the double quote. From the double quote up until the HTML-escaped />
are attributes that belong to the image tag. Let's hope they don't contain HTML. CSS is fine (for our purposes).
3) Let's build the regex: an opening IMG tag, any attributes (let's hope they don't contain HTML) (captured), the src
attribute and its specific value (escaped or unescaped), any other attributes (captured), the end of tag, the URL (captured), some more attributes (captured) and the HTML-escaped closing tag.
/<img([^>]*?)src="(?:<|\<\;)a href="([^>]*?)>([^"]+?)"(.*?)\/>/gi
You might be interested in how it's seen by RegexPal.com.
What it should be replaced by: The image with the proper attributes concatenated, and with the src
salvaged. It might be worthy to filter the attributes, so let's opt for a callback-replace. Normal attributes contain only word-characters in their keys. More importantly, normal attributes are usually non-empty strings (IMG tags don't have boolean attributes, unless you are using server-side maps). This will match all empty attributes but not valid attribute keys: /\S+(?:="")?(?!=)/
Here is the code:
//forEach, indexOf, map need shimming in IE<9
//querySelectorAll cannot be reliably shimmed, so I'm not using that.
//author: Jan Dvorak
// https://stackoverflow.com/a/14157761/499214
var images = document.getElementsByTagName("img");
var parents = [];
[].forEach.call(images, function(i){
if(
/(?:<|\<\;)a href=/.test(i.getAttribute("src"))
&& !~parents.indexOf(i.parentNode)
){
parents.push(i.parentNode)
}
})
var re = /<img([^>]*?)src="(?:<|\<\;)a href="([^>]*?)>([^"]+?)"(.*?)\/>/gi;
parents.forEach(function(p){
p.innerHTML = p.innerHTML.replace(
re,
function(match, attr1, attr2, url, attr3){
var attrs = [attr1, attr2, attr3].map(function(a){
return a.replace(/\S+(?:="")?(?!=)/g,"");
}).join(" ");
return '<img '+attrs+' src="'+url+'" />';
}
);
});
fiddle: http://jsfiddle.net/G2yj3/1/