0

I'm attempting to write some JavaScript code (in particular, a Chrome extension) which does the following:

  1. Retrieve some web page's contents via AJAX.
  2. Get some content from that page by locating certain elements inside of the HTML string and getting their contents.
  3. Do a thing with that data.

I have 1) and 3) working, but I'm having some trouble achieving step 2) in a reasonable way.

I currently have 2) implemented via jQuery(htmlString) and then using normal jQuery selectors and etc. to extract the data I want. The problem is that jQuery actually adds the retrieved HTML to the current page, loading and executing all external resources / scripts in the process. This is obviously bad.

So I'm looking for a way to get the text and HTML in certain tags inside my HTML string without:

  • Loading or executing ANY scripts or resources (images, CSS, etc.) referenced in the HTML string.
  • Trying to remove external resources with regular expressions, since we all know what happens when you parse [X]HTML with regex.

I believe that I can achieve what I want using jsdom and jQuery, since jsdom has a FetchExternalResources option which can be set to false. However, jsdom seems to only work in NodeJS, not in the browser.

Is there any reasonable way to do this?

Community
  • 1
  • 1
CmdrMoozy
  • 3,870
  • 3
  • 19
  • 31
  • Have you looked at jQuery.parseHTML()? See [https://api.jquery.com/jquery.parsehtml/](https://api.jquery.com/jquery.parsehtml/). It looks like it'll do exactly what you want. – Jesse Rosalia Jul 17 '14 at 21:09
  • `jQuery.parseHTML` still attempts to load external images and etc., and additionally its attempts at not executing scripts are trivially thwarted - from the documentation: "However, it is still possible in most environments to execute script indirectly, for example via the attribute." – CmdrMoozy Jul 17 '14 at 21:17
  • Oh yea you're right. Can you guarantee that the remote resource is XHTML? If so, maybe you can use parseXML to parse it. If not, I'm out of ideas. – Jesse Rosalia Jul 17 '14 at 21:22

1 Answers1

3

You could use document.implementation.createHTMLDocument

This is an experimental technology

Because this technology's specification has not stabilized, check the compatibility table for the proper prefixes to use in various browsers. Also note that the syntax and behavior of an experimental technology is subject to change in future versions of browsers as the spec changes

Feature         Chrome  Firefox (Gecko) Internet Explorer   Opera   Safari
Basic support   (Yes)   4.0 (2.0) [1]   9.0                (Yes)    (Yes)

[1] The title parameter has only been made option in Firefox 23.

Javascript

$.ajax("http://www.html5rocks.com/en/tutorials/").done(function (htmlString) {
    var doc = document.implementation.createHTMLDocument("");

    doc.write(htmlString);

    console.log(doc.getElementById('siteheader').textContent);
});

On jsFiddle

You can also take a look at DOMParser and XMLHttpRequest

Example using XMLHttpRequest

XMLHttpRequest originally supported only XML parsing. HTML parsing support is a recent addition.

Feature Chrome  Firefox (Gecko) Internet Explorer   Opera   Safari (WebKit)
Support 18      11              10                  ---     Not supported

Javascript

var xhr = new XMLHttpRequest();
xhr.onload = function () {
    console.log(this.responseXML.getElementById('siteheader').textContent);
};

xhr.open("GET", "http://www.html5rocks.com/en/tutorials/");
xhr.responseType = "document";
xhr.send();

On jsFiddle

Xotic750
  • 22,914
  • 8
  • 57
  • 79