While other ansers focus on guessing whether your desire (parsing DOM without string manipulation) makes sense, I will dedicate this answer to the comparison of reasonable DOM parsing methods.
For a fair comparison, I assume that we need the <body>
element (as root container) for the parsed DOM. I have created a benchmark at http://jsperf.com/domparser-vs-innerhtml-vs-createhtmldocument.
var testString = '<body>' + Array(100001).join('<div>x</div>') + '</body>';
function test_innerHTML() {
var b = document.createElement('body');
b.innerHTML = testString;
return b;
}
function test_createHTMLDocument() {
var d = document.implementation.createHTMLDocument('');
d.body.innerHTML = testString;
return d.body;
}
function test_DOMParser() {
return (new DOMParser).parseFromString(testString, 'text/html').body;
}
The first method is your current one. It is wel-supported accross all browsers.
Even though the second method has the overhead of creating a full document, it has a big benefit over the first one: resources (images) are not loaded. The overhead of the document is marginal compared to the potential network traffic of the first one.
The last method is -as of writing- only supported in Firefox 12+ (no problem, since you're writing a GreaseMonkey script), and is the specific tool for this job (with the same advantages of the previous method). As it name implies, it is a DOM parser.
The bench mark shows that the original method is the fastest 4.64 Ops/s, followed by the DOMParser method 4.22 Ops/s. The slowest method is the createHTMLDocument
method 3.72 Ops/s. The differences are minimal though, so I definitely recommend the DOMParser
for the reasons stated earlier.
I know that you're using GM_xmlhttprequest
to fetch data. However, if you're able to use XMLHttpRequest
instead, I suggest to give the following method a try: Instead of getting plain text as response, you can get a document as a response:
var xhr = new XMLHttpRequest();
xhr.open('GET', 'http://www.example.com/');
xhr.responseType = 'document';
xhr.onload = function() {
var bodyElement = xhr.response.body; // xhr.response is a document object
};
xhr.send();
If Greasemonkey script is long active on a single page, you can still use this feature for other domains which do not support CORS: Insert an iframe in the document whose domain is equal to the other domain (eg http://example.com/favicon.ico
), and use it as a proxy (activate the GM script for this page as well). The overhead of insering an iframe is significant, so this option is not viable for one-time requests.
For same-origin requests, this option may be the best one (although not benchmarked, one can argue that returning a document directly instead of intermediate string manipulation offers performance benefits). Unlike the DOMParser
+text/html method, the responseType="document"
is supported by more browsers: Chrome 18+, Firefox 11+ and IE 10+.
, scans for plethora of elements and replaces textContent of each one with "[" and "]", locates all elements and removes them. I figured it didn't matter since I've become most comfortable with this method and they're not the bottleneck.
– User2121315 Oct 07 '12 at 20:50element among millions of alike, the best approach is to get .textContent (after my replacements). If only I could discard the rest...
– User2121315 Oct 07 '12 at 20:57` element, if the HTML is properly formatted, that means there can't be a nested `
` or block element. So you may want to get the index of the first `
`, then the index of the first `
` or the next ``. If there's no `
` or next ``, then you'd need to find the next opening tag for a block type element. Then grab the content in between the indices.
– I Hate Lazy Oct 07 '12 at 21:06