18

I load HTML from other pages to extract and display data from that page:

$.get('http://example.org/205.html', function (html) {
    console.log( $(html).find('#c1034') );
});

That does work but because of the $(html) my browser tries to load images that are linked in 205.html. Those images do not exist on my domain so I get a lot of 404 errors.

Is there a way to parse the page like $(html) but without loading the whole page into my browser?

Gras Double
  • 15,901
  • 8
  • 56
  • 54
PiTheNumber
  • 22,828
  • 17
  • 107
  • 180

7 Answers7

17

Use regex and remove all <img> tags

 html = html.replace(/<img[^>]*>/g,"");
Bhuvan Rikka
  • 2,683
  • 1
  • 17
  • 27
  • That worked for me. Notice it would not work for style background images. Therefor you would need an [XML parser](http://stackoverflow.com/questions/11006216/load-an-html-string-into-jquery-without-requesting-images?rq=1) I guess. Thanks! – PiTheNumber Feb 27 '13 at 14:05
  • @PiTheNumber & Bhuvan: FWIW, that that regex is trivial to bypass: http://jsbin.com/wejosoku/1 I'd like to think it would work with repeated application, but I wouldn't want to bet my site on no one being able to come up with a way around it. Regex is fundamentally unsuited to significant HTML parsing. – T.J. Crowder May 20 '14 at 06:59
  • @T.J.Crowder I know it's not save but in my case I can trust the other domains HTML code. Regex is bad for mostly everything and I strongly advice to avoid it where ever possible. I would be happy to see another solution but a full html parser would be to big for this. – PiTheNumber May 20 '14 at 11:33
17

Actually if you look in the jQuery documentation it says that you can pass the "owner document" as the second argument to $.

So what we can then do is create a virtual document so that the browser does not automatically load the images present in the supplied HTML:

var ownerDocument = document.implementation.createHTMLDocument('virtual');
$(html, ownerDocument).find('.some-selector');
PiTheNumber
  • 22,828
  • 17
  • 107
  • 180
Thomas Brus
  • 931
  • 5
  • 11
  • I have not tested this, but it looks to me like the best solution for this problem. If it does not work let me know. You could still use the string replace below but I always thought it is a [bad solution](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – PiTheNumber May 07 '18 at 07:29
  • Thanks, this is wat i needed – Wim Pruiksma Jan 13 '20 at 11:11
4

Sorry for resuscitating an old question, but this is the first result when searching for how to try to stop parsed html from loading external assets.

I took Nik Ahmad Zainalddin's answer, however there is a weakness in it in that any elements in between <script> tags get wiped out.

<script>
</script>
Inert text
<script>
</script>

In the above example Inert text would be removed along with the script tags. I ended up doing the following instead:

html = html.replace(/<\s*(script|iframe)[^>]*>(?:[^<]*<)*?\/\1>/g, "").replace(/(<(\b(img|style|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g, "");

Additionally I added the capability to remove iframes.

Hope this helps someone.

Barak Gall
  • 1,422
  • 12
  • 24
3

Using the following way to parse html will load images automatically.

var wrapper = document.createElement('div'),
    html = '.....';
wrapper.innerHTML = html;

If use DomParser to parse html, the images will not be loaded automatically. See https://github.com/panzi/jQuery-Parse-HTML/blob/master/jquery.parsehtml.js for details.

  • 1
    Perhaps I've missed something - but this example WILL cause the images to be reloaded. See https://jsfiddle.net/Abeeee/deg3846s/4/ for an example - if you have Devtools showing the Network trace then you'll see "richard" being loaded twice. https://stackoverflow.com/a/50194774/1432181 seems to have the working solution – user1432181 Oct 20 '21 at 16:59
1

You could either use jQuerys remove() method to select the image elements

console.log( $(html).find('img').remove().end().find('#c1034') );

or remove then from the HTML string. Something like

console.log( $(html.replace(/<img[^>]*>/g,"")) );

Regarding background images, you could do something like this:

$(html).filter(function() {
    return $(this).css('background-image') !== ''; 
}).remove();
Johan
  • 35,120
  • 54
  • 178
  • 293
1

The following regex replace all occurance of <head>, <link>, <script>, <style>, including background and style attribute from data string returned by ajax load.

html = html.replace(/(<(\b(img|style|script|head|link)\b)(([^>]*\/>)|([^\7]*(<\/\2[^>]*>)))|(<\bimg\b)[^>]*>|(\b(background|style)\b=\s*"[^"]*"))/g,"");

Test regex: https://regex101.com/r/nB1oP5/1

I wish there is a a better way to work around (other than using regex replace).

Nik
  • 709
  • 4
  • 22
0

Instead of removing all img elements altogether, you can use the following regex to delete all src attributes instead:

html = html.replace(/src="[^"]*"/ig, "");
Revadike
  • 557
  • 4
  • 11
  • That would break the html because the src attribute is mandatory for the `` element. See https://developer.mozilla.org/de/docs/Web/HTML/Element/img – PiTheNumber May 08 '17 at 10:19
  • That may be true, but it's a good alternative solution for anyone that uses img tag in their css selector or need data from one of the image attributes. – Revadike May 08 '17 at 14:26