How to optimize string-to-DOM conversion?

Question

I'm faced with a slight inconvenient 'lag' when I attempt to populate a div created in JavaScript:

var el = document.createElement("div");
el.innerHTML = '<insert string-HTML code here>'

However, this is natural due to extent of the HTML code; sometimes it's more than 300,000 characters long and it is derived from GM_xmlHttpRequest which sometimes takes 1000ms (give or take) to complete, plus the additional 500ms caused by the DOM-ification.

I have attempted to get rid of massive amount of text using substr (granted not the best idea that could've occurred to me), and it surprisingly worked for the most part, but at certain times element would fail to accept HTML code (probably unmatched <*.?>).

I only need to access an extremely small amount of text that's stored inside; regexp is per bobince out of the question and figured this would be the best approach.

EDIT: I'm inclined to mention that my definition of parsing the DOM has been underrated, I meant to say that this 'text' was the textContent of a quite a few elements which I modify. Therefore, regexp isn't an option.

"I only need to access an extremely small amount of text that's stored inside" if you don't tell us how you're going to "identify" the fragment of the text you need to extract, how are we supposed to propose a solution? — CAFxX, Oct 07 '12 at 20:32
Can you show the fragment that you're targeting? Regex isn't good for parsing an entire DOM, but that doesn't mean it can never be used in situations involving HTML markup. — I Hate Lazy, Oct 07 '12 at 20:33
`bigString.substr(0,0)` will make things pretty zippy, but maybe that's overly broad for your needs? That's not entirely clear. — ultranaut, Oct 07 '12 at 20:35
@CAFxX: With `getElementsBy{Class,Tag}Name()`, it catches first
, scans for plethora of elements and replaces textContent of each one with "[" and "]", locates all ^{elements and removes them. I figured it didn't matter since I've become most comfortable with this method and they're not the bottleneck.} — User2121315, Oct 07 '12 at 20:50
Wait, so you're saying that you actually need to use the entire DOM created, but you just need to do some replacements? Your question makes it sound like you're trying to *extract* a small part of it. — I Hate Lazy, Oct 07 '12 at 20:53
@user1689607: But it's only one
element among millions of alike, the best approach is to get .textContent (after my replacements). If only I could discard the rest... — User2121315, Oct 07 '12 at 20:57
@User2121315: Since it's a `
` element, if the HTML is properly formatted, that means there can't be a nested `
` or block element. So you may want to get the index of the first `
`, then the index of the first `
` or the next `
`. If there's no `
` or next `
`, then you'd need to find the next opening tag for a block type element. Then grab the content in between the indices. — I Hate Lazy, Oct 07 '12 at 21:06
@user1689607: That actually might be promising, I will try to do just that and hope I don't find a page that's the exception of the rule. — User2121315, Oct 07 '12 at 21:16
Yeah, that's the concern. You'll need *some* sort of guarantees to match the tags. And you'll need to take into consideration the capitalization of the tags and any potential attributes. Good luck! :-) — I Hate Lazy, Oct 07 '12 at 21:21

Rob W · Accepted Answer · 2012-10-07T21:26:47.920

While other ansers focus on guessing whether your desire (parsing DOM without string manipulation) makes sense, I will dedicate this answer to the comparison of reasonable DOM parsing methods.

For a fair comparison, I assume that we need the <body> element (as root container) for the parsed DOM. I have created a benchmark at http://jsperf.com/domparser-vs-innerhtml-vs-createhtmldocument.

var testString = '<body>' + Array(100001).join('<div>x</div>') + '</body>';

function test_innerHTML() {
    var b = document.createElement('body');
    b.innerHTML = testString;
    return b;
}
function test_createHTMLDocument() {
    var d = document.implementation.createHTMLDocument('');
    d.body.innerHTML = testString;
    return d.body;
}
function test_DOMParser() {
    return (new DOMParser).parseFromString(testString, 'text/html').body;
}

The first method is your current one. It is wel-supported accross all browsers.
Even though the second method has the overhead of creating a full document, it has a big benefit over the first one: resources (images) are not loaded. The overhead of the document is marginal compared to the potential network traffic of the first one.

The last method is -as of writing- only supported in Firefox 12+ (no problem, since you're writing a GreaseMonkey script), and is the specific tool for this job (with the same advantages of the previous method). As it name implies, it is a DOM parser.

The bench mark shows that the original method is the fastest ^{4.64 Ops/s}, followed by the DOMParser method ^{4.22 Ops/s}. The slowest method is the createHTMLDocument method ^{3.72 Ops/s}. The differences are minimal though, so I definitely recommend the DOMParser for the reasons stated earlier.

I know that you're using GM_xmlhttprequest to fetch data. However, if you're able to use XMLHttpRequest instead, I suggest to give the following method a try: Instead of getting plain text as response, you can get a document as a response:

var xhr = new XMLHttpRequest();
xhr.open('GET', 'http://www.example.com/');
xhr.responseType = 'document';
xhr.onload = function() {
    var bodyElement = xhr.response.body; // xhr.response is a document object
};
xhr.send();

If Greasemonkey script is long active on a single page, you can still use this feature for other domains which do not support CORS: Insert an iframe in the document whose domain is equal to the other domain (eg http://example.com/favicon.ico), and use it as a proxy (activate the GM script for this page as well). The overhead of insering an iframe is significant, so this option is not viable for one-time requests.

For same-origin requests, this option may be the best one (although not benchmarked, one can argue that returning a document directly instead of intermediate string manipulation offers performance benefits). Unlike the DOMParser+text/html method, the responseType="document" is supported by more browsers: Chrome 18+, Firefox 11+ and IE 10+.

Interesting, I show a 200ms difference between `DOMParser` and `createElement`. The first mentioned takes about +/-422ms and the other one +/-642ms. Ironically, the very tests hosted there suggest a 3.11 Ops/s on the very method I used originally. I assume it didn't account for images. — User2121315, Oct 07 '12 at 21:15
@User2121315 I have added another method to my answer. Might be interesting for you, depending on your situation. — Rob W, Oct 07 '12 at 21:28
I would like to thank you for introducing me to `DOMParser`, it is certainly better than the method I've been using. As for the `XMLHttpRequest`, I'm afraid I'm bound by CORS but the iframe solution sounds very hackish and perhaps outweighs the benefit of those 450ms. — User2121315, Oct 07 '12 at 21:39

score 0 · Answer 2 · answered Oct 07 '12 at 20:39

We'd need to know a bit more about your application, but when you're working with that much HTML content, you might just want to use an iframe. It's asynchronous, it won't stall JS code, and it won't introduce a plethora of potential debugging problems.

It can be dangerous to populate an element with raw HTML from an xmlhttprequest, mainly due to potential XSS vulnerabilities and next-to-impossible-to-fix HTML glitches. If at all possible, consider using a template (I believe JQuery offers some sort of templating solution) and loading a small amount of XML/JSON/etc. Only do that if using an iframe is out of the question though.

At this point, other parts of my code use the DOM-method to access data, and the code is working asynchronously parallel to another main application. — User2121315, Oct 07 '12 at 21:02

jfriend00 · Answer 3 · 2012-10-07T20:54:13.260

I you have a giant amount of HTML and it's taking a long time to put in the DOM and you only want a small piece of it, the ways to make that faster are:

Get your server to serve up only the parts of the HTML you actually want. This would save on both the networking transfer time and the DOM parsing time.
If you can't modify the server, then you need to manually parse some of the HTML to eliminate the parts you don't want so not as much will put in the DOM. A regex is one of the slower ways to search a giant string so it's better to use something like .indexOf() if possible to identify the general area you are targeting. If there is a unique id or class and you know the general form of the HTML, you can use a faster algorithm like that to identify the target area. But, without you disclosing the actual HTML to be parsed, we can't offer more specifics than that.

#1: Sadly, I'm not able to use this method. #2: I forgot to mention that, even though the text is small, it involves less-than-complex-but-more-than-simple manipulations of DOM. — User2121315, Oct 07 '12 at 21:04
@User2121315 - so without showing us the actual HTML, how do you expect us to offer any specifics about how to trim it quickly? — jfriend00, Oct 07 '12 at 21:16

How to optimize string-to-DOM conversion?

3 Answers3