Parsing XHTML string with Regex in Javascript and converting it to DOM

Question

Disclaimer: before the you-can't-parse-html-with-regex blind mantra begins - please give me the benefit of the doubt and read this question to the end (+ assume I already know about That RegEx-ing the HTML will drive you crazy and Parsing Html The Cthulhu Way)

Most of the complaints with Regex matching HTML come from the fact that HTML is loosely formed and Regex has difficulty matching different problems and user errors + some other things like recursion, etc.

However - what if HTML is actually valid XHTML (or more XML-like), that originated from a controlled environment (not general user-generated HTML document, but for example HTML-fragment templates that you would use in a client-side templating engine) and has been both manually checked for errors and validated numerous times?

Let me explain why I'm interested. I'm doing a speed benchmark of different String2DOM techniques in Javascript and I've tested everything from innerHTML, outerHTML, insertAdjacentHTML, createRange, DOMParser, doc.write (via iFrame) and even John Riesigs HTMLtoDOM JS library.

And I'm curious if there is a way to go even faster.

createElement/appendChild (+setAttribute and createTextNode) is the fastest way to create DOM elements in Javascript. Regex is the fastest way to traverse large strings. Couldn't these two methods still be combined to possibly create an even faster way to parse DOMString fragments into DOM?

An example HTML string:

<div class="root fragment news">

    <div class="whitebg" data-name='Freddie Mercury'>
        <div id='myID' class="column c2">
            <h1>This is my title</h1>
            <p>Vivamus urna <em>sed urna ultricies</em> ac<br/>tempor d </p>
            <p>Mauris vel neque sit amet Quisque eget odio</p>
        </div>      

        <div class="nfo hide">Lorem <a href='http://google.com/'>ipsum</a></div>
    </div>

</div>

So ideally the code would return a documentFragment with Regex parsing the XHTML soup and using createElement/appendChild (+setAttribute/createTextNode) to fill in the elements. (a similar but not quite there yet example is HTML2DOM)

I (and the rest of the world) am very very interested if something like that could beat the good old innerHTML in generating DOM from DOMString in JS. Could it?

Who's game to try their knowledge making something like that? And claim their place in the annals of Stackoverflow? :)

EDIT2: who ever is blindly down-voting this - at least explain what you feel is wrong with the question? I am pretty familiar with the subject, have provided the logic behind it and also explain what is different about this scenario + even post some links that provide similar solutions. What about you?

To be pedantic, I'm fairly confident that it is XHTML, not xHTML. — Sean Bright, Jun 22 '12 at 18:10
I'll change it just for the sake of accuracy - but doesn't even matter because its basically about html fragments only (not complete documents). My example above is not even XHTML (at least not 1.1) since it has a custom HTML5 data-name attribute. The XML/XHTML part was just to stress its about valid strict tags/templates (so that potential answers / arguments don't begin with - HTML are loosely formed documents blah discussion) — Michael, Jun 22 '12 at 18:19
I disagree that `createElement` is the fastest way to create DOM elements. If you have a large tree, setting `innerHTML` is much faster in current browsers (2012). — Borealid, Jun 23 '12 at 07:41
I doubt your motives. It is the same old, same old story of "I know I shouldn't but I want to anyway because \*I\* have the right reasons". You don't. I'm not sure why you think a client-run JavaScript/regex based thing could be any faster than the browser-integrated, native, highly optimized parser. Also, Regex is by no means the fastest way to parse large strings, that assertion of your's is completely unjustified. If you feel you must parse (X)HTML with regex, go ahead and learn enough about regex to do it. Asking others to do it for you, ruling out certain responses right away, is unfair. — Tomalak, Jun 23 '12 at 07:51
@Borealid actually for modern (webkit-based) browser the opposite is true. Here a test with a large exactly the same template between innerHTML and createElement http://jsperf.com/domstring ... both results are cached offscreen for even greater speed.. and both are on pair speed wise - I suspect for smaller fragments such as the one from my question, createElement would win hands on... — Michael, Jun 23 '12 at 08:12
@Tomalak you doubt my motives? :-) okay... well tell me than - whats faster than Regex to parse large strings in Javascript - I'm open to learning new things? And this is the right site to do that. Also I'm not ruling out certain responses - I've tried to explain that most negative comments made in relation to Regex-HMTL-parsing here don't apply because I'm not trying to parse every loosely formed HTML site out there (but my own strict/validated templates). It seems to me that you're the one just applying the same old response to every Regex/HTML-parsing question no matter what its about — Michael, Jun 23 '12 at 08:17
@Michael I just ran your test in FF 13, and it says innerHTML is faster. Setting that aside, though, and answering your "what's faster than Regex to parse large strings": the answer is "don't do it in Javascript". The browser has a built-in state-machine-based lexer and parser which is optimized native code. It exists for one purpose and one purpose only: parsing HTML. There's no way a JS-driven regex which you wrote for the **exact same purpose** will be as fast, although you're welcome to try. — Borealid, Jun 23 '12 at 08:19
@Tomalak also I've asked for help with Regex, a field where by no means I'm an expert in... I think your response that I should go ahead and just learn it myself is unfair - you could answer something similar for every question asked on Stack Overflow... — Michael, Jun 23 '12 at 08:19
@Borealid you could run it in a dozen different browsers and you'd get different results. You prefer Firefox, I prefer webkit-based (where it is faster) as I'm mostly interested in mobile development. Ruling it out like that is flawed - the Jsperf I've shown you shows that createElement can be faster. Parsing strings is just one part of the equation - generating DOM is another. Also there are many ways to transfer string to dom - innerHTML, outerHTML, insertAdjacentHTML, createRange, DOMParser, doc.write - none of them have the same speed. Testing and exploration is what its all about. — Michael, Jun 23 '12 at 08:24
@Michael First off: What makes you think that regex is the fastest way to parse a string? Clearly you must have some grounds for this assumption. Secondly. Yes, this site is about learning. But it's not about giving bad advice. Logic (JS will never be as fast as native, no matter what you do) and years of developer experience speak against you. Just. Don't. You are wasting your time. It surely would be a nice exercise to train your regex skills, but please don't expect us to write HTML-parsing regular expressions for you. — Tomalak, Jun 23 '12 at 08:33
I'm doing benchmarking for a research paper - so by definition I'm wasting my time already. I don't believe the answer is as simple as browser HTML parse is always the fastest because over the last 10 years there have been different methods of doing that in the browser and the preferred methods changed a lot over the years. If the answer is as simple as that wouldn't innerHTML, outerHTML, insertAdjacentHTML, createRange, DOMParser, doc.write, etc - all yield the same result? (they give drastically different results in the same and in different browsers BTW) — Michael, Jun 23 '12 at 08:45
Its not as simple as that. I think you guys are mixing 2 things - 1 is parsing a string with Regex and the other is generating DOM. Its NOT just parsing HTML. innerHTML is something that started IE only but was implemented by other browsers over time. createElement and friends are official W3C core + the preferred official way to parse HTML is DOMParse but the problem is that it at the moment only does text/xml crossbrowser (aside from the latest Firefox where it supports text/html too). So as allways online - there is room and justification for hacks. @Tomalak if you don't want to help -don't — Michael, Jun 23 '12 at 08:46
You probably don't know Perl, but http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491 — BoltClock, Jun 23 '12 at 08:54
@BoltClock See tchrist's comments to the [second answer in this thread](http://stackoverflow.com/a/4234582/18771). — Tomalak, Jun 23 '12 at 09:05
@Tomalak: I hadn't noticed that this question wasn't limiting itself to short, predictable XHTML strings. — BoltClock, Jun 23 '12 at 09:08
@BoltClock thanks, I did take a look at that but its just too different from how you'd do it in JS (plus its parsing whole documents which is a bit over the top for me - I just need fragments)... I'll try doing it myself over the weekend - was just hoping there was somebody willing and better versed in Regex here who'd give me a hand with that pattern string matching part of it... — Michael, Jun 23 '12 at 17:29

score 0 · Answer 1 · answered Jun 23 '12 at 08:46

0

First off, the answer to all performance-oriented questions is "just benchmark it". You can write the code if you want to write the code, and its performance will speak for itself.

That said, I'm going to attempt to answer your question from my knowledge of web browser behavior and potentially save you some man-hours.

No, a custom Javascript-driven HTML parser could not "beat the good old innerHTML in generating DOM from DOMString in JS". It might, in theory, be able to get equally good performance, but that result is very unlikely.

The reason why is because Javascript is an interpreted language. An ideal JS interpreter will optimize the JS code down to its native equivalent sequence of browser-API calls. So, in the best case, writing JS code that does the equivalent of platform-native code will get identical performance: the JS code cannot outperform its native equivalent because, under the hood, it must still make the native calls.

The task at hand here is creating a DOM tree. Here's what happens when you set the innerHTML of an element:

JS: Browser, render me some HTML! Here's a Javascript string object.

Browser: parse_html_and_create_dom_objects()

Browser: notify_javascript_of_dom_creation()

Now, here's what happens if you drive the parser with Javascript:

JS: scan_string_for_next_token()

JS: Browser, add a DOM element here!

Browser: create_dom_object()

JS: scan_string_for_next_token()

JS: Browser, add a DOM element here!

Browser: create_dom_object()

JS: Browser, append the DOM tree you created to this visible-on-screen DOM tree!

Browser: refresh_page_view_and_notify_js()

In the native version, what would be a sequence of JS calls back to the browser can all be batched together and performed in pure preoptimized C.

I think the reason you believe it might be faster to do the parsing in JS than in the browser internals is because you've found that some web browsers have calling createElement repeatedly take less time than setting innerHTML to a chunk. This is because those two calls are not performing the same amount of work. When you call createElement, you're not doing string processing (no tokenization, no lexing). When you call innerHTML = <string>, you are. So whether innerHTML is faster than a series of createElement calls depends on whether the cumulative overhead of getting the elements from JS one by one outweighs the cost of parsing the HTML string. In other words, you cheated: your benchmark is not measuring an equal amount of work, since the code that calls createElement must have known in advance which elements to create.

It is very unlikely that both parsing the HTML string and creating the elements individually from JS could be faster than doing both inside the browser. If you do manage to write JS code that outperforms the browser internals, please submit it upstream to the browser authors: web browser performance improvements help everybody, and I'm sure the developers would appreciate the irony of getting superior performance from within a nested interpreter than the best they could achieve outside that interpreter.

answered Jun 23 '12 at 08:46

Borealid

95,191
9
106
122

I appreciate your answer Borealid - but as I've already answered in another commenting threat - all I'm trying to do is follow the first 2 sentences of your answer - I do want to benchmark it. But have little experience with Regex - thats why I asked for help. Also its worth nothing that "The Browser" is not a single entity - innerHTML in Webkit is a lot slower than in IE and Firefox, so createElement technique might make a lot of sense in Chrome/Safari. And webkit browsers probably make up 90% of the smartphone market - where every little performance bit in webapps helps. – Michael Jun 23 '12 at 08:50
@Michael To make a JS implementation of an HTML parser, have you tried using emscripten to compile the webkit core? As to the "`createElement` technique" making sense, it doesn't - it's not applicable to the same problem domain. If you have a raw string with HTML in it, you can't just call `createElement` on it. What I'm trying to tell you is that something has to parse the string. The reason `createElement` can be faster is that it *doesn't* parse the HTML. Putting the parser in JS will not be as fast as doing parse+append - otherwise known as setting `innerHTML`. – Borealid Jun 23 '12 at 08:52
1

Trying one last time to get the idea across: `time(set-innerHTML) < time(js-parse-HTML-to-dom) + time(createElement)`. I guarantee it. `time(set-innerHTML) = time(createElement-internal) + time(parse-HTML-internal)`. `time(parse-HTML-internal) < time(js-parse-HTML-to-dom)`. – Borealid Jun 23 '12 at 08:57
Parsing strings in Javascript is an operation that on modern PCs runs to the tune of millions of operations per seconds. DOM alterations are (generally speaking) "just" in the thousand operations per second range. To me - its worth it to test whether combining them would not make as much of an impact as you seem to believe it would across every possible browser (because again - parsing a string is MUCH faster than adding to DOM). And again innerHTML in webkit does not behave as it does in IE/Firefox - so there might be sense in doing it via createElement there. – Michael Jun 23 '12 at 09:00
@Michael You keep missing the point. "Parsing" strings is such a broad term that you can't make such a general statement in the first place. I can write regex that "parses" a string in the order of one operation per second. You're making unfounded assumptions and base an entire theory on them. Dissecting a string with regex and building a DOM from the parts with the DOM API will be slower than passing a string to an HTML parser. No matter how you put it. It is a matter of very simple, straightforward logic. Let go of the notion that regex is cheap, close to a no-op. It isn't. – Tomalak Jun 23 '12 at 09:16
@Tomalak Ah I'm very sorry but its you who keeps missing the point. The first few of you answers I kept rebufing or posting Jsperf benchmarks to - but it slowly getting ridiculous because your keep generalizing and speaking about "HTML parser" and these "browsers". Which HTML parser and which browser are you referring to? There is no HTML parser - only different methods (that BTW operate at very different speed) of accessing it. The closest thing to this "HTML parser" you keep referring to is DOMParser() which for text/html is not supported almost anywhere. I just want to tune how I access it – Michael Jun 23 '12 at 17:13
In Webkit methods like innerHTML are a lot slower than native w3c methods like createElement (and friends). Yes its easily possible to write a string parser that would take seconds or minutes - but I gave an example of my string above and I seriously doubt it would take even remotely that long. So since string operations are generally a loooot cheaper than DOM operations - I believe it would be worth it to try something like I propose. I base every word I say on my own tests performed using benchmark.js or online at jsperf - what theory are you basing your assumptions on? – Michael Jun 23 '12 at 17:13
so again to me testing something like this is very much worth it (in webkit-browsers at least): _very-fast-time (parse string) + fast-time(createElement) < medium-time (innerHTML)_ ... because its founded on solid proven basis - innerHTML is slower and string operations are cheaper than DOM operations... I'm not writing my own DOM parser - I'm just tweaking how you access the one in the browser... heck innerHTML is not even part of W3C specs (until HTML5) and has different results across the browsers - up until 3 years ago everybody was using hidden iframes because it was the best thing ever – Michael Jun 23 '12 at 17:31
@Michael You came here for answers. You got exactly one answer, from multiple sources. You might notice that Tomalek has over 95,000 reputation, especially focused in the "javascript" and "html" tags, and I have 20+k myself. You then proceeded to argue and ignore the answers. I'm telling you: "parse string" is not "very-fast-time". You will find that the time taken to parse the string is always greater than the difference between the performance of `innerHTML` and `createElement`, in any extant web browser. – Borealid Jun 23 '12 at 18:15
@Borealid ah so sooner or later it comes to who's bigger eh? I got a repetition of the exact same regex mantra from exactly two people. My reputation has actually grown since more people seem interested in the answer. Reputation is relative - I don't spend as much time on the site but have a lot of specific JS experience. Also you call it arguing but I'm the only one posting tests and reasoning behind my logic. You just keep repeating opinions. +You don't really read my comments - I always say _string operations are cheaper than DOM operations_ and you always hear _string operations are cheap_ – Michael Jun 23 '12 at 19:13

Parsing XHTML string with Regex in Javascript and converting it to DOM

1 Answers1