I want to allow embedding of HTML but avoid DoS due to deeply nested HTML documents that crash some browsers. I'd like to be able to accommodate 99.9% of documents, but reject those that nest too deeply.
Two closely related question:
- What document depth limits are built into browsers? E.g. browser X fails to parse or does not build documents with depth > some limit.
- Are document depth statistics for documents available on the web? Is there a site with web statistics that explains that some percentage of real documents on the web have document depths less than some value.
Document depth is defined as 1 + the maximum number of parent traversals needed to reach the document root from any node in a document. For example, in
<html> <!-- 1 -->
<body> <!-- 2 -->
<div> <!-- 3 -->
<table> <!-- 4 -->
<tbody> <!-- 5 -->
<tr> <!-- 6 -->
<td> <!-- 7 -->
Foo <!-- 8 -->
the maximum depth is 8 since the text node "Foo" has 8 ancestors. Ancestor here is interpreted non-strictly, i.e. ever node is its own ancestor and its own descendent.
Opera has some table nesting stats, which suggest that 99.99% of documents have a table nesting depth of less than 22, but that data does not contain whole document depth.
EDIT:
If people would like to criticize the HTML sanitization library instead of answering this question, please do. http://code.google.com/p/owasp-java-html-sanitizer/wiki/AttackReviewGroundRules explains how to find the code, where to find a testbed that lets you try out attacks, and how to report issues.
EDIT:
I asked Adam Barth, and he very kindly pointed me to webkit code that handles this.
Webkit, at least, enforces this limit. When a treebuilder is created it receives a tree limit that is configurable:
m_treeBuilder(HTMLTreeBuilder::create(this, document, reportErrors, usePreHTML5ParserQuirks(document), maximumDOMTreeDepth**(document)))
and it is tested by the block-nesting-cap test.