5

i am trying to get information out of an html table by parsing the html using HtmlAgilityPack.

here is what the HTML looks like:

...
...
...
<tbody>
                    <tr>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_18">AA00857</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div></div>
                            <div class="style_20">TPRCF</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_21"></div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_21">16908/2</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_18">&nbsp;ETG_C</div>
                        </td>
                    </tr>
                    <tr>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_18">AA01231</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div></div>
                            <div class="style_20">TPRCF</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_21"></div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_21">16909/19</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_18">&nbsp;ETG_C</div>
                        </td>
                    </tr>
                    <tr>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_18">AA01233</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div></div>
                            <div class="style_20">TPRCF</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_21"></div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_21">16907/7</div>
                        </td>
                        <td class="style_19" style="vertical-align: baseline;">
                            <div class="style_18">&nbsp;ETG_C</div>
                        </td>
                    </tr>
...
...

i need to extract from the above these values:

AA00857, TPRCF, 16908/2, ETG_C

so far all i have is this:

HtmlWeb hw = new HtmlWeb();
            HtmlAgilityPack.HtmlDocument htmlDoc = hw.Load(@"http://www.some123123site.com/index");



            if (htmlDoc.DocumentNode != null)
            {
                HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//tbody");

                if (bodyNode != null)
                {
                    // Do something with bodyNode
                }
            }

please help!

JOE SKEET
  • 7,950
  • 14
  • 48
  • 64

1 Answers1

2

Try this:

HtmlWeb hw = new HtmlWeb();              
HtmlAgilityPack.HtmlDocument htmlDoc = hw.Load(@"http://www.some123123site.com/index");                 
if (htmlDoc.DocumentNode != null)              
{                   
        foreach(HtmlNode text in htmlDoc.DocumentNode.SelectNodes("//tr/td/div/text()"))
        {     
            Console.WriteLine(text.InnerText);  
        }
}
Chandu
  • 81,493
  • 19
  • 133
  • 134
  • Error 1 'HtmlAgilityPack.HtmlDocument' does not contain a definition for 'DocumentElement' and no extension method 'DocumentElement' accepting a first argument of type 'HtmlAgilityPack.HtmlDocument' could be found Error 1 'HtmlAgilityPack.HtmlDocument' does not contain a definition for 'DocumentElement' and no extension method 'DocumentElement' accepting a first argument of type 'HtmlAgilityPack.HtmlDocument' could be found – JOE SKEET Jan 07 '11 at 21:49
  • @cybernate thank you, for some reason it does not like this line: HtmlAgilityPack.HtmlDocument htmlDoc = hw.Load(@"http://www.some123123site.com/index"); it is trying to save the file when i run it – JOE SKEET Jan 07 '11 at 22:07
  • I tested it with a URL @ my localhost and I could see the result. Are you using the same code or modified it? – Chandu Jan 07 '11 at 22:10
  • @cybernate: here is my problem, the URL that i am trying to open is restricted, i first need to log in, to a different page, what do i do? – JOE SKEET Jan 07 '11 at 22:29
  • @spark would you know how to get around this? – JOE SKEET Jan 07 '11 at 22:32
  • Is it windows authentication or a custom auth? – Chandu Jan 07 '11 at 22:35