Getting specific data from html

Question

I want to get specific data from html. Im using c# and HtmlAgilityPack

Here's the HTML sample:

<p class="heading"><span>Greeting!</span>

<p class='verse'>Hi!<br>               //
Hello!</p><p class='verse'>Hello!<br>  // i want to get this g
Hi!</p>                                //

<p class="writers"><strong>WE</strong><br/>

Here my code in c#:

StringBuilder pureText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Lyrics);

var s = doc.DocumentNode.Descendants("p");

try
{
     foreach (HtmlNode childNode in s)
     {
                        pureText.Append(childNode.InnerText);
     }
}
catch
{ }

UPDATE:

StringBuilder pureText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(URL);

var s = doc.DocumentNode.SelectNodes("//p[@class='verse']"); // error

try
{
     foreach (HtmlNode childNode in s)
     {
            pureText.Append(childNode.InnerText);
     }
}
catch
{ }

ERROR:

'HtmlAgilityPack.HtmlNode' does not contain a definition for 'SelectNodes' and no extension method 'SelectNodes' accepting a first argument of type 'HtmlAgilityPack.HtmlNode' could be found (are you missing a using directive or an assembly reference?)

har07 · Accepted Answer · 2014-01-19T08:36:43.900

5

You can try with XPath query syntax to select all <p> having class='verse', like this :

var s = doc.DocumentNode.SelectNodes("//p[@class='verse']");

Then do the same foreach as you already have.

UPDATE I :

I don't know why the code above throwing error for you. It has been tested in my PC and should work fine. Anyway if you accept workaround, the same query can be achieved without XPath this way :

var s = doc.DocumentNode.Descendants("p").Where(o => o.Attributes["class"] != null && o.Attributes["class"].Value == "verse");

This solution is longer since we need to check if a node has class attibutes or not, before checking the attributes' value. Otherwise, we'll get Null Reference Exception if there any <p> without class attributes.

edited Jan 19 '14 at 08:36

answered Jan 19 '14 at 07:55

har07

88,338
12
84
137

2

has an error of 'HtmlAgilityPack.HtmlNode' does not contain a definition for 'SelectNodes' and no extension method 'SelectNodes' accepting a first argument of type 'HtmlAgilityPack.HtmlNode' could be found (are you missing a using directive or an assembly reference?) – user3190447 Jan 19 '14 at 08:04
argument of `SelectNodes` should be a string as you see in my answer, not `HtmlNode`. How you applied this solution? Try to post your code that trigger the error if you couldn't find out how to fix it – har07 Jan 19 '14 at 08:06
StringBuilder pureText = new StringBuilder(); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(URL); var s = doc.DocumentNode.SelectNodes("//p[@class='verse']"); // error try { foreach (HtmlNode childNode in s) { pureText.Append(childNode.InnerText); } } catch { } – user3190447 Jan 19 '14 at 08:21
1

are you working on WinRT application? if yes, [this post](http://stackoverflow.com/questions/15941529/htmlagilitypack-windows-8-metro-apps) maybe related to the error you got. WinRT doesn't support XPath. – har07 Jan 19 '14 at 08:51
Yes im working on WinRT. So SelectNodes doesnt work. How do i get data from html? – user3190447 Jan 19 '14 at 08:58
using linq to xml syntax, as shown in **UPDATE I** section of this answer. – har07 Jan 19 '14 at 08:59

Getting specific data from html

1 Answers1