Work-around a StackOverflowException

Question

I'm using HtmlAgilityPack to parse roughly 200,000 HTML documents.

I cannot predict the contents of these documents, however one such document causes my application to fail with a StackOverflowException. The document contains this HTML:

<ol>
    <li><li><li><li><li><li>...
</ol>

There are roughly 10,000 <li> elements nested like that. Due to the way HtmlAgilityPack parses HTML it causes a StackOverflowException.

Unfortunately a StackOverflowException is not catchable in .NET 2.0 and later.

I did wonder about setting a larger size for the thread's stack, but setting a larger stack size is a hack: it would cause my program to use a lot more memory (my program starts about 50 threads for processing HTML, so all of these threads would have the increased stack size) and would need manually adjusting if it ever came across a similar situation again.

Are there any other workarounds I could employ?

Not really. Unless you feel like switching to a different library, I don't see a better way than increasing stack size. Perhaps there is a way to set the stack size only for the thread(s) that need(s) it? — Matt Ball, Oct 01 '12 at 00:24

sjdirect · Answer 1 · 2013-03-08T21:07:48.710

I just patched an error that I believe is the same as your describing. Uploaded the patch to the hap project site...

http://www.codeplex.com/site/users/view/sjdirect (see the patch on 3/8/2012)

Or see more documentation of the issue and result here....

https://code.google.com/p/abot/issues/detail?id=77

The actual fix was... Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."

How I'm Using Hap After Patch...

HtmlDocument hapDoc = new HtmlDocument();
hapDoc.OptionMaxNestedChildNodes = 5000;//This is what was added
string rawContent = GETTHECONTENTHERE
try
{
    hapDoc.LoadHtml(RawContent);    
}
catch (Exception e)
{
    //Instead of a stackoverflow exception you should end up here now
    hapDoc.LoadHtml("");
    _logger.Error(e);
}

Dai · Accepted Answer · 2022-01-18T00:31:39.660

Ideally, the long-term solution is to patch HtmlAgilityPack to use a heap-stack instead of the call-stack, but that would be an undertaking too big for me. I've temporarily lost my CodePlex account details, but when I get them back I'll submit an Issue report on the problem. I also note that this issue could present a Denial-of-Service attack vulnerability to any site that uses HtmlAgilityPack to sanitize user-submitted HTML - a crafted overly-nested HTML document would cause the w3wp.exe process to die.

In the meantime, I figured the best way forward is to manually override the maximum thread stack size. I was wrong in my earlier statement that a bigger stack-size means that all threads automatically consume that memory (it seems memory pages are allocated for a thread stack as it grows, not all-at-once).

I made a copy of the <ol><li> page and ran some experiments. I found that my program failed when the stack size was less than 2^21 bytes (2MB) in size, but a maximum size of 2^22 bytes (4MB) succeeded - and 4MB in my book passes as an "acceptable" hack... for now.

score -1 · Answer 3 · answered Jan 11 '22 at 10:31

-1

This should work:


HtmlDocument.MaxDepthLevel = 10000;
var doc = new HtmlDocument();
try
{
    doc.LoadHtml(document);
}
catch(Exception ex)
{
    Console.WriteLine("Exception while loading html: " + ex);
    yield break;
}

answered Jan 11 '22 at 10:31

setail

63
4

Your answer could be improved by adding more information on what the code does and how it helps the OP. – Tyler2P Jan 11 '22 at 12:50
You can’t catch a `StackOverflowException` since the 2.0 CLR. – Dai Jan 11 '22 at 18:32

Work-around a StackOverflowException

3 Answers3

Linked