0

I have a .NET windows service which takes HTML content and generates Word 2007 files out of them. Now, the HTML content is cleaned up (empty tags are removed etc) via a recursive function before it is converted to a Word 2007 document. However, there are some big HTML content which create "out of memory" exception because of the recursive function. I put a retry counter on the method so that the function is not called more than the counter number of times. However, that resulted in many HTML files not getting converted or getting converted to bad Word 2007 contents.

If I try to divide the HTML source code to process, It might complicate things as each HTML structure is different and splitting content would probably lead to change the clean up code.

Need some suggestions on how to handle this problem.

Any help would be very much appreciated.

Ashish Gupta
  • 14,869
  • 20
  • 75
  • 134
  • 1
    Can you use F#, and have your C# code call that, so you can use tail recursion (http://stackoverflow.com/questions/33923/what-is-tail-recursion)? – James Black May 21 '11 at 21:48
  • @James, very interesting. I never knew about tail recursion. However, I am not sure If I can make that change at this point of time. Thank you for the suggestion. – Ashish Gupta May 21 '11 at 22:20

2 Answers2

1

Don't use recursion. Try the HTML Agility Pack.

It's a HTML parser that is commonly recommended for this. It will take malformed HTML and massage it into XHTML and then a traversable DOM, like the XML classes.

Piotr Perak
  • 10,718
  • 9
  • 49
  • 86
  • Peri- I have looked into that. the legacy recursive code also does things like "

    Content here and here

    Content here and here

    " to "

    Content here and here

    ". Not sure If HTML Agility pack can do something like that.
    – Ashish Gupta May 21 '11 at 22:29
  • I don't know it that well. But if it gives You object model of document You can look for unclosed tags probably. But wait a second. I looked at OutOfMemoryException in You question and thought StackOverflowException. I have seen StackOverflow when recursion didn't have condition to stop but never OutOfMemory. So maybe recursion isn't a problem. – Piotr Perak May 22 '11 at 07:38
0

You could try wrapping the outermost call to the recursive function with a try...catch statement to catch OutOfMemoryException. That would at least let you continue with the next file.

MRAB
  • 20,356
  • 6
  • 40
  • 33