1

I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify the correct line and extract the number, in this case 324. How can I do this?

<h2><a href="/mmp/it/su/">Weather</a></h2> <span class="jix_channels_count">(324)</span><br><p class="jix_channels_desc">Prog&oslash;r, su, si&oslash;r, tester</p>
Ergwun
  • 12,579
  • 7
  • 56
  • 83
Kasper Hansen
  • 6,307
  • 21
  • 70
  • 106

2 Answers2

2

After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

Another option is SgmlReader.

You tagged your question with regex - I wholeheartedly advice you not taking this direction.

The suggested approach (with SgmlReader) goes more or less like so:

var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();                
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
             {
                 DocType = "HTML",
                 WhitespaceHandling = WhitespaceHandling.None,
                 CaseFolding = CaseFolding.ToLower,
                 InputStream = sr
             };
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader); 

Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

Konrad Morawski
  • 8,307
  • 7
  • 53
  • 91
2

Parsing html page with regexes is wrong. Still if you know the exact structure of a single html line, you can use regex without thinking about the line as an html code.

Assuming that the number always is within the brackets and the span with jix_channels_count class:

Match match = Regex.Match(htmlLine, @"(\<span[^>]*class=""jix_channels_count[^>]*\>\()([^)]+)(\))", RegexOptions.IgnoreCase);
if (match.Success)
{
    string number = match.Groups[2].Value;
}
Marek Musielak
  • 26,832
  • 8
  • 72
  • 80