2

I am trying to do some screen scraping using HtmlAgilityPack using SelectNodes and getting some values from each node returned

Here is the code

private readonly HtmlDocument _document = new HtmlDocument();

public void ParseValues(string html)
{
    _document.LoadHtml(html);
    var tables = _document.DocumentNode.SelectNodes("//table");

    foreach (var table in tables)
    {
        _document.LoadHtml(table.OuterHtml);
        var value = _document.DocumentNode.SelectSingleNode("//tbody[1]/tr/td[0]");
    }
}

But I have noticed that when trying to select children with inside the foreach loop it actually searches from the document root. Something that is really annoying.

Questions:

  1. Is there a way to select the values from each table returned from SelectNodes without having to create new document instance from the HtmlDocument?

  2. Is there a way to dispose HtmlDocument, because I noticed that there is a memory leak every time I use _document.LoadHtml(html);

Andrew Whitaker
  • 124,656
  • 32
  • 289
  • 307
Roman Ratskey
  • 5,101
  • 8
  • 44
  • 67

1 Answers1

1

(for a more detailed explanation, see Html Agility Pack - Problem selecting subnode)


You don't have to create another HtmlDocument object, or load another HTML into it. You just have to do:

foreach (var table in tables)
{
    var value = table.SelectSingleNode(".//tbody[1]/tr/td[0]");
}

The key is to use .//tbody instead of //tbody.

Community
  • 1
  • 1
Oscar Mederos
  • 29,016
  • 22
  • 84
  • 124