0

I want to parse this file: (only the important parts)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
</head>
<body onload="Xaprb.InputMask.setupElementMasks()">
<div align="center">
        <table> ... </table>
        <table width="900" height="500" border="0" cellpadding="0"
            cellspacing="0" class="content">
        <tr>
    <td width="45">&nbsp;</td>
    <td width="210" valign="top">
    <div class="np_table">
        <div class="np_bl">
            <div class="np_br">
                <div class="np_tl">
                    <div class="np_tr">
                    <span class="name_heading">Hello</span><br />
                    <span class="name_content">**NAME I NEED**</span><br />
                    <br /> <span class="name_heading">Number:</span><br />
                    <span class="name_content">**NUMBER I NEED**</span>
                    </div>
                </div>
            </div>
        </div>
    </div> <br>

    <div class="menu"> ... </div>

    <p>&nbsp;</p>
    </td>
    <td width="600" valign="top">
        <div class="content_table">
        <div class="ct_bl">
            <div class="ct_br">
                <div class="ct_tl">
                    <div class="ct_tr">
                       <span class="heading">...</span>
                       <p><b>**I need this number too: 250**</b> <br />
               <br />
               Here is the datum I want: **17-04-2014**. <br />
               Please do not...</p>
               <p><b>...</b></p>
    <br /><br>
                 </div>
            </div>
        </div>
      </div>
    </div>
    </td>
</body>
</html>

And now I want four strings, the two numbers, the date and a name. I have this code:

HttpClient client = new HttpClient();
var doc = new HtmlAgilityPack.HtmlDocument();
var html = await client.GetStringAsync("http://example.com");
doc.LoadHtml(html);

var name = ???
var numberone = ???
var numbertwo = ???
var date = ???

But I don't know how I become these information with the HTML Agility Pack. Can somebody help me? Or give me hints?

1 Answers1

0

We can use XPath query to select specific part of HTML document using HtmlAgilityPack. So read some XPath tutorials to get started :

For example, to get NAME I NEED from sample HTML in this question :

var name = 
    doc.DocumentNode
       .SelectSingleNode("//span[@class='name_content' and .='Hello']/following-sibling::span[1]");
if(name!= null) Console.WriteLine(name.InnerText);

Explanation of XPath used in above sample :

//span

  • scan entire document for <span> element...

[@class='name_content' and .='Hello']

  • having class attribute value equals "name_content" and element value equals "Hello",

/following-sibling::span[1]

  • then get from current <span>, nearest following sibling element of type <span>...
har07
  • 88,338
  • 12
  • 84
  • 137
  • Thanks! And this is for the Name, but the Number item is also in a span element with the class name = 'name_content', and for the second number there is the element dir with the class name = 'ct_tr', but how can I read the first number in a specific element var numbertwo and the date in a specific element var date? – user3493797 Apr 23 '14 at 11:02
  • fixed my sample, by `r` I meant `name`. There are a lot of work to solve in one question. So I gave one sample, you try to figure criteria that is possibly working to select the rest part, then try to translate the criteria into XPath query. At whichever point you get stuck, open a question showing how far you have tried and researched – har07 Apr 23 '14 at 11:27
  • Ok I understand it now, thanks for help! I got it for the rest! :) – user3493797 Apr 23 '14 at 11:29