C# Scrape data from wiki page (screen-scraping)

Question

I want to scrape a Wiki page. Specifically, this one.

My app will allow users to enter the registration number of the vehicle (for example, SBS8988Z) and it will display the related information (which is on the page itself).

For example, if the user enters SBS8988Z into a text field in my application, it should look for the line on that wiki page

SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)

and return SBS8988Z (SLBP 192/194*) - F&N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen).

My code so far is (copied and edited from various websites)...

WebClient getdeployment = new WebClient();
string url = "http://sgwiki.com/wiki/Scania_K230UB_(Batch_1_Euro_V)";

getdeployment.Headers["User-Agent"] = "NextBusApp/GetBusData UserAgent";
string sgwikiresult = getdeployment.DownloadString(url); // <<< EXCEPTION
MessageBox.Show(sgwikiresult); //for debugging only!

HtmlAgilityPack.HtmlDocument sgwikihtml = new HtmlAgilityPack.HtmlDocument();
sgwikihtml.Load(new StreamReader(sgwikiresult));
HtmlNode root = sgwikihtml.DocumentNode;

List<string> anchorTags = new List<string>();   

foreach(HtmlNode deployment in root.SelectNodes("SBS8988Z"))
{
    string att = deployment.OuterHtml;
    anchorTags.Add(att);
}

However, I am getting a an ArgumentException was unhandled - Illegal Characters in path.

What is wrong with the code? Is there an easier way to do this? I'm using HtmlAgilityPack but if there is a better solution, I'd be glad to comply.

See this link also. [http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack][1] [1]: http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack — Prasanth, Sep 19 '11 at 11:56
yep I've seen them. One of those sites was where I got this snippet of code from! Of course, I edited it, but it's not working :( — ryanswj, Sep 19 '11 at 12:03

Jeff Mercado · Accepted Answer · 2011-09-24T05:51:01.177

What's wrong with the code? To be blunt, everything. :P

The page is not formatted in the way you are reading it. You can't hope to get the desired contents that way.

The contents of the page (the part we're interested in) looks something like this:

<h2>
<span id="Deployments" class="mw-headline">Deployments</span>
</h2>
<p>
    <!-- ... -->
    <b>SBS8987B</b>
    (SLBP 192/194*)
    <br>
    <b>SBS8988Z</b>
    (SLBP 192/194*) - F&amp;N NutriSoy Fresh Milk: Singapore's No. 1 Soya Milk! (2nd Gen)
    <br>
    <b>SBS8989X</b>
    (SLBP SP)
    <br>
    <!-- ... -->
</p>

Basically we need to find the b elements that contain the registration number we are looking for. Once we find that element, get the text and put it together to form the result. Here it is in code:

static string GetVehicleInfo(string reg)
{
    var url = "http://sgwiki.com/wiki/Scania_K230UB_%28Batch_1_Euro_V%29";

    // HtmlWeb is a helper class to get pages from the web
    var web = new HtmlAgilityPack.HtmlWeb();

    // Create an HtmlDocument from the contents found at given url
    var doc = web.Load(url);

    // Create an XPath to find the `b` elements which contain the registration numbers
    var xpath = "//h2[span/@id='Deployments']" // find the `h2` element that has a span with the id, 'Deployments' (the header)
              + "/following-sibling::p[1]"     // move to the first `p` element (where the actual content is in) after the header
              + "/b";                          // select the `b` elements

    // Get the elements from the specified XPath
    var deployments = doc.DocumentNode.SelectNodes(xpath);

    // Create a LINQ query to find the  requested registration number and generate a result
    var query =
        from b in deployments                 // from the list of registration numbers
        where b.InnerText == reg              // find the registration we're looking for
        select reg + b.NextSibling.InnerText; // and create the result combining the registration number with the description (the text following the `b` element)

    // The query should yield exactly one result (or we have a problem) or none (null)
    var content = query.SingleOrDefault();

    // Decode the content (to convert stuff like "&amp;" to "&")
    var decoded = System.Net.WebUtility.HtmlDecode(content);

    return decoded;
}

Haha! Is it right to assume that only information between tags can be found that way? I will implement this, and try it out. Thanks so much, Jeff! — ryanswj, Sep 23 '11 at 11:37
n00b question: then what happens after this chunk of code? do I just paste it under my private void getDeployment_Click (object sender, EventArgs e) section? I'm also getting an error: Since getDeployment_click returns void, a return keyword must not be followed by an object expression. THANKS SO MUCH! :) — ryanswj, Sep 23 '11 at 11:45
It's just a method. Paste it somewhere in your class and call it from wherever you need it. — Jeff Mercado, Sep 23 '11 at 18:22
thanks Jeff! That completely worked for me. If possible, could you explain what the code does after the var doc = web.Load(url) part? Thanks! — ryanswj, Sep 24 '11 at 05:15
Jeff - it didn't work for this page http://sgwiki.com/wiki/Volvo_B10M_Mark_IV_(Walter_Alexander_Strider) when I try to find SBS1903P and http://sgwiki.com/wiki/Volvo_B10M_Mark_IV_(DM3500) when I try to find SBS2838M. From reading the code, I have inferred that this problem is caused by the extra heading below the
Deployment heading. Is there any way to work around this? — ryanswj, Sep 24 '11 at 07:23
In those cases, testing for equality no longer applies since most items on those pages have subscripted numbers at the end. You would either have to include those numbers when you supply the `reg` variable or change the comparison to use `StartsWith()`. You'll also have to figure out how to include the multiple subsections found in the other page. I really can't help you any further on this right now. You really need to learn the basics and how to apply them here. Once you've made some progress, I'd be glad to help you further but until then... — Jeff Mercado, Sep 24 '11 at 07:40
All right then. Jeff, I'm really grateful for all this help you've given me. Thanks a lot! — ryanswj, Sep 24 '11 at 07:51

C# Scrape data from wiki page (screen-scraping)

1 Answers1

Deployment heading. Is there any way to work around this?