What is the maximum depth of HTML documents in practice?

Question

I want to allow embedding of HTML but avoid DoS due to deeply nested HTML documents that crash some browsers. I'd like to be able to accommodate 99.9% of documents, but reject those that nest too deeply.

Two closely related question:

What document depth limits are built into browsers? E.g. browser X fails to parse or does not build documents with depth > some limit.
Are document depth statistics for documents available on the web? Is there a site with web statistics that explains that some percentage of real documents on the web have document depths less than some value.

Document depth is defined as 1 + the maximum number of parent traversals needed to reach the document root from any node in a document. For example, in

<html>                   <!-- 1 -->
  <body>                 <!-- 2 -->
    <div>                <!-- 3 -->
      <table>            <!-- 4 -->
        <tbody>          <!-- 5 -->
          <tr>           <!-- 6 -->
            <td>         <!-- 7 -->
              Foo        <!-- 8 -->

the maximum depth is 8 since the text node "Foo" has 8 ancestors. Ancestor here is interpreted non-strictly, i.e. ever node is its own ancestor and its own descendent.

Opera has some table nesting stats, which suggest that 99.99% of documents have a table nesting depth of less than 22, but that data does not contain whole document depth.

EDIT:

If people would like to criticize the HTML sanitization library instead of answering this question, please do. http://code.google.com/p/owasp-java-html-sanitizer/wiki/AttackReviewGroundRules explains how to find the code, where to find a testbed that lets you try out attacks, and how to report issues.

EDIT:

I asked Adam Barth, and he very kindly pointed me to webkit code that handles this.

Webkit, at least, enforces this limit. When a treebuilder is created it receives a tree limit that is configurable:

m_treeBuilder(HTMLTreeBuilder::create(this, document, reportErrors, usePreHTML5ParserQuirks(document), maximumDOMTreeDepth**(document)))

and it is tested by the block-nesting-cap test.

I'm curious, where did you get the idea that there *is* a nesting limit, or "deeply nested HTML documents that crash some browsers"? I've never heard of that. — Wesley Murch, Oct 14 '11 at 16:35
@Wesley Murch, empirically. [An issue that came out of an attack review](http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=3) says "OK, I didn't circumvent the protection, but I managed to crash Firefox 8 and make it unusable until I restarted it in safe mode. My input was about 20000×
(opening only, no closing)". — Mike Samuel, Oct 14 '11 at 16:36
Is this the only source? Are you sure the crash was due to deeply nested tags and not something else? Was there a control test? Are you sure this was not directly related to the HTML Sanitizer only? — Wesley Murch, Oct 14 '11 at 16:41
I think nesting of html is not really your most pressing concern. There are a _lot_ of evil things that users can do with HTML. http://www.codinghorror.com/blog/2008/10/programming-is-hard-lets-go-shopping.html — Nick ODell, Oct 14 '11 at 16:42
@WesleyMurch, The output from the HTML sanitizer when given 20k open div tags is `("
" x 20000) + ("
" x 20000)`. This reaches the browser just fine and the browser then crashes. I have ruled out nesting depths < 20k and have ruled out the HTML sanitizer itself as a static document repeats the problem. In the case of FF8 it only manifests with a certain plugin, but in the case of a particular version of Chrome, it seems to be the browser renderer itself. — Mike Samuel, Oct 14 '11 at 16:51
@NickODell, I am aware that there are lots of evil things that users can do with HTML. This is the most pressing concern right now as it is the only remaining unresolved issue to come out of the first round of attack review. — Mike Samuel, Oct 14 '11 at 16:52
@WesleyMurch, if this is an XY Problem, what question should I be asking? — Mike Samuel, Oct 14 '11 at 16:54
@NickODell, Thanks for the link. This implementation is not vulnerable to the problems outlined in that post -- it does not use regular expressions or any other pattern based filters. It tokenizes HTML, applies tag and element whitelists, and then uses a normalizing renderer to produce a syntactically valid result. — Mike Samuel, Oct 14 '11 at 16:58
Good point, deleted my comment about "XY problem" - surely you know about attack vectors other than "document depth" (apparently). Not sure if you tested with proper closing tags or not, but in any case you might want to look at HTML tidy or something to ensure the HTML is even valid, then clean it up or reject it. Not sure what languages you are open to using to accomplish your task, might want to add that. +1 for interesting question, regardless. — Wesley Murch, Oct 14 '11 at 16:59
@WesleyMurch, NickODell I edited the OP with a link to the attack review ground rules in case you're interested. I would love to get more criticism of the sanitizer itself. — Mike Samuel, Oct 14 '11 at 17:02
@WesleyMurch, the sanitizer library is implemented in Java, but that is incidental to this question which is why I did not tag it as such. I'm glad that you're skeptical by default -- I would have a harder time learning about novel attack vectors if people weren't. — Mike Samuel, Oct 14 '11 at 17:04
Once the limit is reached, I think some browsers will flatten deeply-nested nodes into adjacent child nodes of the parent. I don't remember where I read that though. It might have been a webkit bug where I found out about it. — Shadow2531, Oct 14 '11 at 17:08
@Shadow2531, thanks. I don't see anything about folding on exceeding depth limits in http://www.w3.org/TR/html5/tree-construction.html#insert-an-html-element but that doesn't mean that browsers don't agree on how they deal with deeply nested content. — Mike Samuel, Oct 14 '11 at 17:54
I don't think there is an actual limit. If there was a built in limit, the browser wouldn't crash. Even so, you can easily test it. Just create a document with nested elements of various depth. You can make a PHP script that does that. If you start with 10, you can double the depth on each try. If the browser crashes you know the limit is somewhere between LastOkDepth and LastOkDepth * 2. You can then start halfway inbetween and cut the range in half each time. That way, you can find each browser's limits in maybe a couple of dozen tries. — GolezTrol, Oct 14 '11 at 18:45
@GolezTrol, I am not asking about an actual limit for all browsers. If people know the limit for a particular browser, that's great. I am also asking about the distribution of maximum nesting depths in real HTML documents. If I were to crawl the web, what is the k such that 99.99% of documents have a nesting depth <= k. — Mike Samuel, Oct 14 '11 at 20:18
@Pekka, depth is a problem if http://code.google.com/p/owasp-java-html-sanitizer/issues/detail?id=3 is accurate. Most browsers work just fine on 214kB HTML files, but some fail when the entire file is `
...
` 20000 deep. — Mike Samuel, Oct 14 '11 at 21:32
@Mike I'm not sure whether depth really is the problem there - it seems it's more the number of invalid elements, isn't it? Would it still break if it were a proper 20000 deep HTML structure? I still think you can fix this by setting a limit on either a sane number of *tags*, or an arbitrary sane nesting depth (like, say, 1000)... But then, I haven't had to deal with this in real life so my view may be too simplistic. — Pekka, Oct 14 '11 at 21:36
If you are concerned about this, what the browser can do is the least of your worries. You're doing it all wrong. — Rob, Oct 14 '11 at 21:53
@Shadow2531, I found the folding code in webkit. [Line 100 of HTMLConstructionSite.cpp](http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLConstructionSite.cpp#L100) says `if (m_openElements.stackDepth() > m_maximumDOMTreeDepth)parent = parent->parentNode();` — Mike Samuel, Oct 18 '11 at 22:25

score 21 · Accepted Answer · edited Oct 31 '17 at 16:17

It may be worth asking coderesearch@google.com. Their study from 2005 (http://code.google.com/webstats/) doesn't cover your particular question. They sampled more than a billion documents though, and are interested in hearing about anything you feel is worth examining.

--[Update]--

Here's a crude script I wrote to test the browsers I have (putting the number of elements to nest into the query string):

var n = Number(window.location.search.substring(1));

var outboundHtml = '';
var inboundHtml = '';

for(var i = 0; i < n; i++)
{
    outboundHtml += '<div>' + (i + 1);
    inboundHtml += '</div>';
}

var testWindow = window.open();
testWindow.document.open();
testWindow.document.write(outboundHtml + inboundHtml);
testWindow.document.close();

And here are my findings (may be specific to my machine, Win XP, 3Gb Ram):

Chrome 9: 3218 nested elements will render, 3129 crashes tab. (Chrome 9 is old I know, the updater fails on my corporate LAN)
Safari 5: 3477 will render, 3478 browser closes completely.
IE8: 1000000+ will render (memory permitting), although performance degrades significantly when into high 4-figure numbers due to event bubbling when scrolling/moving the mouse/etc. Anything over 10000 appears to lock up, but I think is just taking a very long time, so is effective DoS.
Opera 11: Just limited by memory as far as I can tell, i.e. my script runs out of memory for 10000000. For large documents that do render though, there doesn't seem to be any performance degradation like in IE.
Firefox 3.6: ~1500000 will render but testing above this range resulted in the browser crashing with Mozilla Crash Reporter or just hanging, sometimes a number which worked would fail a subsequent time, but larger numbers ~1700000 would crash Firefox straight from a restart.

More on Chrome:

Changing the DIV to a SPAN resulted in Chrome being able to nest 9202 elements before crashing. So it's not the size of the HTML that is the reason (although SPAN elements may be more lightweight).

Nesting 2077 table cells (<table><tr><td>) worked (6231 elements), until you scrolled down to cell 445, then it crashed, so you can't nest 445 Table Cells (1335 elements).

Testing with files generated from the script (as opposed to writing to new windows) give slightly higher tolerances, but Chrome still crashed.

You can nest 1409 list items (<ul><li>) before it crashes, which is interesting because:

Firefox stops indenting list items after 99, a programmatic constraint maybe.
Opera's keeps indenting with glitches at 250, 376, 502, 628, 754, 880...

Setting a DOCTYPE is effective in IE8 (putting it into standards mode, i.e. var outboundHtml = '<!DOCTYPE html>';): It will not nest 792 list items (the tab crashes/closes) or 1593 DIVs. It made no difference in IE8 whether the test was generated from the script or loaded from a file.

So the nesting limit of a browser apparently depends on the type of HTML elements the attacker is injecting, and the layout engine. There could be some HTML considerably smaller than this. And we have a plain-HTML DoS for IE8, Chrome and Safari users with a considerably small payload.

It seems if you are going to allow users to post HTML that gets rendered on one of your pages, it is worth considering a limit on nested elements if there is a generous size limit.

Thanks. I didn't get stats, but I got pointers into webkit code which enforces this. I edited the OP with the pointers. — Mike Samuel, Oct 18 '11 at 22:21
WRT. Firefox, I've run into this lovely little bug myself: https://bugzilla.mozilla.org/show_bug.cgi?id=256180 As a result, any elements past 200 in depth simply are not rendered. You can test this by a simple script that creates a string of over-200 depth (I used 500 for argument's sake), which contains a known string, then testing if the known string appears anywhere when you render it. — Gert Sønderby, Aug 20 '15 at 13:00

score 4 · Answer 2 · answered Oct 18 '11 at 22:30

4

For webkit, the maximum document depth is configurable, but by default it is 512

http://trac.webkit.org/browser/trunk/Source/WebCore/page/Settings.h#L408

static const unsigned defaultMaximumHTMLParserDOMTreeDepth = 512;

answered Oct 18 '11 at 22:30

Mike Samuel

118,113
30
216
245

Fantastic! But does the browser crash? – Lee Kowalkowski Oct 19 '11 at 08:46
@LeeKowalkowski, WebCore shouldn't. It folds children of nodes past this limit into the parent rather than increase the stack as at http://trac.webkit.org/browser/trunk/Source/WebCore/html/parser/HTMLConstructionSite.cpp#L100 but other browsers do crash. – Mike Samuel Oct 19 '11 at 10:54
I've managed to crash Chrome, Safari and IE8 quite easily, Firefox and Opera seem to just run out of memory really (not obvious whether it's my script or the document). I've included my findings in my answer. – Lee Kowalkowski Oct 21 '11 at 10:17

What is the maximum depth of HTML documents in practice?

2 Answers2

Linked