JavaScript repair bad html tag

Question

I'm working on a Sharepoint website. I don't have access to the webparts code. I can only change master pages with JavaScript.

One of the webpart has a bug. It changes the <img> with a bad SRC value.

example:

should have

<img alt="img" src="http://www.apicture.png" style="margin:5px" /><br /><br />

but have

<img alt="img" src="<a href="http://www.apicture.png">http://www.apicture.png</a>" style="margin:5px" /><br /><br />

I tried to match and replace but the innerHtml broke the others scripts.

How can a repair my with JavaScript ?

Edit:

I have the code:

var markup = document.documentElement.innerHTML;
markup = markup.replace(/src=\".*?(http:\/\/[^\"]+)\"/g,'src=\"$1\"');
document.documentElement.innerHTML = markup;

but it broke my webpage.

If the `src` contains unescaped double quotes and is itself double-quoted, fixing it is going to be difficult... — John Dvorak, Jan 04 '13 at 10:55
isn't better to put effort on correcting the faulting webpart than adding some messy code to post fix the issue? This is maybe even cheaper to properly fix the webpart than tweaking the DOM with JS — Steve B, Jan 04 '13 at 10:55
I agree with you but my boss doen't want i change the webpart > — Sancho, Jan 04 '13 at 11:00
@Sancho you better convince your boss the damage is already done and will take 6-8 weeks to fix properly. — John Dvorak, Jan 04 '13 at 11:01
The document with such a result will NEVER validate... I tried to fast-fix it but unable. You should really get angry and force to fix this one where it started. — Roko C. Buljan, Jan 04 '13 at 11:03
@roXon it will not validate but you can try to predict what the browser does to try to fix it. — John Dvorak, Jan 04 '13 at 11:05
With a relative img url the webpart work but i need absolute url. — Sancho, Jan 04 '13 at 12:17

score 5 · Accepted Answer · edited May 23 '17 at 11:56

Since the DOM has already been broken, you need to take a step back and try to salvage the HTML.

1) Find the parents of the broken elements. While search&replace inside the document.body.innerHTML would probably work, you shouldn't really let regexes anywhere near large chunks of HTML. Performance is a concern as well, albeit a lesser one.

<img alt="img" src="<a href="http://... will get parsed by the browser as an image with the source "<a href=".

With jQuery, you can simply ask $('img[src="<a href"]') to get the images. Except in IE<8, you can use querySelectorAll with the same selector. If you don't have jQuery, and want to support IE7, you need to use getElementsByTagName with manual filtering.

If you are really lucky, you can find the parent via getElementByID (or the equivalent jQuery).

This is the easy part.

2) Your HTML doesn't validate, and the browser had already made some effort to fix it. You need to reverse the process. Predicting the browser actions is problematic, but let's attempt to.

Let's see what the browser does with

<img src="<a href="http://www.test.com/img/image-20x20.png">http://www.test.com/img/image-20x20.png</a>" style="margin:5px" />

This is how Chrome and Firefox fix it:

<img src="&lt;a href=" http:="" www.test.com="" img="" image-20x20.png"="">http://www.test.com/img/image-20x20.png" style="margin:5px" /&gt;

IE9 sorts the attributes within img alphabetically in innerHTML (o_0) and doesn't HTML-escape the < within src. IE7-8 additionally strip ="" from the attributes.

The image attributes will be hard to salvage, but the text content is unharmed. Anyways the pattern can be seen:

everything starting at <img and until src= should be preserved. Unfortunately, in IE, the arguments are rearranged, so you have to preserve the incorrect tags as well. src="..." itself must be removed. Everything past that is [incorrect] in modern browsers, but in IE, proper attributes could have crept there (and vice versa). Then the image tag ends.

Everything that follows is the real URL, up until the double quote. From the double quote up until the HTML-escaped /> are attributes that belong to the image tag. Let's hope they don't contain HTML. CSS is fine (for our purposes).

3) Let's build the regex: an opening IMG tag, any attributes (let's hope they don't contain HTML) (captured), the src attribute and its specific value (escaped or unescaped), any other attributes (captured), the end of tag, the URL (captured), some more attributes (captured) and the HTML-escaped closing tag.

/<img([^>]*?)src="(?:<|\&lt\;)a href="([^>]*?)>([^"]+?)"(.*?)\/&gt;/gi

You might be interested in how it's seen by RegexPal.com.

What it should be replaced by: The image with the proper attributes concatenated, and with the src salvaged. It might be worthy to filter the attributes, so let's opt for a callback-replace. Normal attributes contain only word-characters in their keys. More importantly, normal attributes are usually non-empty strings (IMG tags don't have boolean attributes, unless you are using server-side maps). This will match all empty attributes but not valid attribute keys: /\S+(?:="")?(?!=)/

Here is the code:

//forEach, indexOf, map need shimming in IE<9
//querySelectorAll cannot be reliably shimmed, so I'm not using that.

//author: Jan Dvorak
// https://stackoverflow.com/a/14157761/499214

var images = document.getElementsByTagName("img");
var parents = [];
[].forEach.call(images, function(i){
  if(
    /(?:<|\&lt\;)a href=/.test(i.getAttribute("src"))
    && !~parents.indexOf(i.parentNode)
  ){ 
    parents.push(i.parentNode)
  }
})

var re = /<img([^>]*?)src="(?:<|\&lt\;)a href="([^>]*?)>([^"]+?)"(.*?)\/&gt;/gi;
parents.forEach(function(p){
  p.innerHTML = p.innerHTML.replace(
    re, 
    function(match, attr1, attr2, url, attr3){
      var attrs = [attr1, attr2, attr3].map(function(a){
        return a.replace(/\S+(?:="")?(?!=)/g,"");
      }).join(" ");
      return '<img '+attrs+' src="'+url+'" />';
    }
  );
});

fiddle: http://jsfiddle.net/G2yj3/1/

Impressive answer! I like how you even commented your code and added author information ;-) — Dennis G, Jan 04 '13 at 13:29
Your code is very impressive. I tried to use it but i have an error. IE and Chrome says "forEach' has Null value or isn't an object" > — Sancho, Jan 04 '13 at 13:39
You're crazy man. Such things deserve some nice fee (I mean in $!) for your effort. +1 — Roko C. Buljan, Jan 04 '13 at 13:53
@Sancho I couldn't reproduce your issue. However, I did find a few bugs while testing and fixed them: http://jsfiddle.net/G2yj3/1/ — John Dvorak, Jan 04 '13 at 13:58
@Sancho as I already said in the code comments, you need a shim if you want to use a `forEach` in IE8 — John Dvorak, Jan 04 '13 at 14:01
I'm testing on Google Chrome. On fiddle it work but on Chrome it does nothing ><. What's the difference between Chrome JS and fiddle JS ? (What is a 'shim' ?) — Sancho, Jan 04 '13 at 14:31
One thing that jsFiddle does that you might have forgotten is that it waits until the page is loaded. Do you wait for the page load? Adding the script at the end of body will suffice. If you have jQuery, you can use `$(function(){`. If neither is an option, look into `window.addEventListener("load", function(){` and the corresponding IE<8 shim. — John Dvorak, Jan 04 '13 at 14:38
@Sancho Shim = shiv = polyfill = implementation of some newer feature for older browsers. There are multiple shim libraries for IE6-8, or you can build one, since MDN lists the polyfill for each EcmaScript 5 method it documents. There's a decent chance there's one built-in to Sharepoint. — John Dvorak, Jan 04 '13 at 14:39
ok. I find one. I have one problem :(. On Chrome it's work but on IE8 i have 'http://www.image.png%3C/A%3E' at the end of the url () — Sancho, Jan 04 '13 at 14:55

score 1 · Answer 2 · answered Jan 04 '13 at 11:11

1

You can repair src attribute with regex but it won't repair the entire page. The reason is that web browser is trying to parse such bad HTML and produces weird output (extra elements etc.) before JS is executed. Since you cannot interfere the HTML parsing/rendering engine, there's no reasonable way other than changing the original content to fix this.

answered Jan 04 '13 at 11:11

oleq

15,697
1
38
65

@JanDvorak can't wait to see it – Roko C. Buljan Jan 04 '13 at 11:15

JavaScript repair bad html tag

2 Answers2