7

I notice a memory problem when i leave my app running for a long time. I actually get a out of memory exception. I try to figured out what the problem was and i was clueless until i let it run again and i notice

I get the leak on this line everytime html.LoadHtml(a_few_k_of_html);. I suspect HtmlAgilityPack is leaking. I tried wrapping it in using and calling dispose but that doesnt exist. Not only does it happen on that line everytime but i remember changing a few areas to use HtmlAgilityPack instead of parsing html with regex

How do i deal with this memory issue short of modifying HtmlAgilityPack itself?

Community
  • 1
  • 1
  • Does your code retain a reference to the results of html.LoadHtml? Are you sure your code is no longer referencing it at all? – Eric J. Apr 17 '12 at 00:40
  • The var html only has the scope of that one function and isn't used anywhere else. I'm positive i am not referencing it anywhere. This would be my first leak ever and i think it may have to do with HtmlAgilityPack backend. @EricJ. –  Apr 17 '12 at 01:04
  • Can you reproduce this in a simple test program? I'd be surprised to find that Html Agility Pack is leaking. I use it in a long-running program (my Web crawler that runs for days at a time, downloading thousands of pages per minute) and haven't noticed any leaks. – Jim Mischel Apr 17 '12 at 04:28
  • @JimMischel: Interesting. I notice the problem occurred when i switched to my new computer (8gigs of ram, complains at 1.5gb even tho i have another 3gb+ free) and installed VS11(beta) but it appears to still use .NET 3.5. Maybe its a runtime problem and not a leak? but still i'm surprised its getting to 1.5gb even if i force call the GC every 15mins. It use to always take <100mb –  Apr 17 '12 at 19:27
  • A couple of questions. I'm groping in the dark here. First, that 1.5 gb is suspicious. Are you sure the program isn't running in 32-bit mode? Second, are you sure that you're not maintaining a reference to the strings that you send to `LoadHtml`? – Jim Mischel Apr 17 '12 at 20:35
  • @JimMischel: I'm positive. That string scope is only in the function. HtmlAgilityPack HtmlDocument is either function scope or class scope. If its in the class scope then the lifetime is the thread. Threads dont die. I start a bunch up and they sleep on its own and i only wake then when i need to –  Apr 17 '12 at 21:20
  • I figured it out. It was a serious bug that i happened to not notice. –  May 18 '12 at 12:15
  • 6
    Man I hate that. Someone has exact problem you do, and you get to the end of the problem thread and they are like "fixed it!" without any explanation. I probably have the same bug. – dylanT Aug 20 '17 at 22:44

2 Answers2

5

I had same problem. After processing the document i set the instance of document to null and then GC.Collect(). Problem was solved.

tomjcz
  • 101
  • 1
  • 5
  • I don't think calling GC.Collect is such a good idea. http://programmers.stackexchange.com/questions/276585/when-is-it-a-good-idea-to-force-garbage-collection – Peter de Bruijn Jun 23 '16 at 08:49
  • +1 no its not a good idea (its a rare case indeed) but when faced with a dll that causes issues this fixed it for me. In my case i wanted to strip text from 300,000 documents and this was the only way through my pain – Rippo Oct 21 '16 at 07:29
2

Try to use HtmlAgilityPack.HtmlDocument Load() method insted of using LoadHtml().

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
MemoryStream ms = new MemoryStream(Encoding.Default.GetBytes(a_few_k_of_html));
doc.Load(ms);
ms.Close();// <-- Important
//Do whatever you want with HtmlDocument
Tunaki
  • 132,869
  • 46
  • 340
  • 423