0

I'm appending a whole HTML page to a div (to scrape). How do I stop it from requesting script, and css files ? I tried immediately removing those nodes but they still get requested.
It's for a browser addon, I'm scraping with JS

NestedWeb
  • 1,657
  • 2
  • 15
  • 31
  • "scrape" how exactly? Do you even need to insert it into the DOM ? – adeneo Jan 06 '15 at 10:23
  • Is it possible to `querySelector` from string ? – NestedWeb Jan 06 '15 at 10:25
  • As you get a string back to append to your page why not just use a js regex to remove script and css tags? – Pete Jan 06 '15 at 10:27
  • There's DOMParser, not sure if it will load the resources or not. But how exactly are you getting the HTML of an entire page into the clientside, there has to be something going on on the serverside here? – adeneo Jan 06 '15 at 10:29
  • 1
    Is the snippet proper xml? If so you can use **[responseXML](http://www.w3schools.com/ajax/ajax_xmlhttprequest_response.asp)** to examine the response. Otherwise look at **[DocumentFragment](https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment)** – asontu Jan 06 '15 at 10:37
  • I think you wanna look at **[this answer](http://stackoverflow.com/a/7539198/2684660)** about loading a page into a DocumentFragment. – asontu Jan 06 '15 at 10:45
  • @NestedWeb - you should add more details who uses this. It may be as simple as turining off javascript in the browser. – avnr Jan 06 '15 at 10:58

2 Answers2

0

As @adeneo wrote you don't have to add the html to a page in order to scrape information from it, you can turn it into DOM tree that is disconnected from the page DOM and process it there.

In jQuery it is simple $("html text here"). Then you can scrape it using the API,

eg.

 function scrape_html(html_string) {
     var $dom = $(html_string);
     var name = $dom.find('.name').text();
     return name;
 }

without jQuery:

function scrape_html(html_string) {
    var container = document.createElement('div');
    container.innerHTML = html_string;
    var name = container.getElementsByClassName('name')[0].innerText;
    return name;
}
Iftah
  • 9,512
  • 2
  • 33
  • 45
  • Unfortunately I am not using jQuery. I'll look more into it. – NestedWeb Jan 06 '15 at 10:39
  • added non-jquery method that may work - it might be problematic adding (which should be top level node) into a div, not sure. – Iftah Jan 06 '15 at 11:00
  • I already tried that, it doesn't work. I needed one big container div which doesn't contain any file links. I decided to split the html string with some ids. It's working fine. – NestedWeb Jan 06 '15 at 11:09
0

Setting the innerHTML of a temporary HTML element that has not been added to the document, will not execute scripts, and since it does not belong to your document, the style will not be applied either.

This will give you an opportunity to strip out any unwanted elements before copying the innerHTML to your own document.

Example:

var temp = document.createElement('div');
temp.innerHTML = html; // the HTML of the 'other' page.

function removeElements(element, tagName)
{
    var elements = temp.getElementsByTagName(tagName);

    while(elements.length > 0)
    {
        elements[0].parentNode.removeChild(elements[0]);
    }
}

removeElements(temp, 'script');
removeElements(temp, 'style');
removeElements(temp, 'link');

container.innerHTML = temp.innerHTML;
Lee Kowalkowski
  • 11,591
  • 3
  • 40
  • 46