Get Zip Codes in Wikipedia with HTML Document

Question

I'm reading this Wikipedia page -> http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain, a list of zip codes in Spain.

My goal is to get all zip codes from the section "Full codes" in webpage. For example i need to get this information (zip code - locality):

03000 to 03099 - Alicante 03189 - Villamartin 03201 to 03299 - Elche 03400 - Villena

In my code, I have difficult to get only li and a tags after the title "Full Codes".

    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain");
    request.UserAgent = "Test wiki";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader reader = new StreamReader(stream);
    string htmlText = reader.ReadToEnd();

    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlText);

    if (doc.DocumentNode != null)
    {
        HtmlNodeCollection divs = doc.DocumentNode.SelectNodes("//li");
        foreach (HtmlNode listElement in divs)
        {
            if (listElement.SelectNodes("//a[@href]").Count > 0)
            { // I do not get what I wish
                foreach (HtmlNode listElement2 in listElement.SelectNodes("//a[@href]"))
                {
                    string s = listElement2.Name;
                    string ss = listElement2.InnerText;
                }
            }
        }
    }

How much do you value your time? There are a lot of websites out there with [out-of-the-box](http://www.geopostcodes.com/Spain) [datasets](https://www.aggdata.com/free/spain-postal-codes) of this kind of thing. — Brad Christie, May 04 '15 at 21:16
You should put WebResponse, Stream and StreamReader in a using statement. Maybe you can also use Regex? — thijmen321, May 04 '15 at 21:17

score 1 · Accepted Answer · edited May 23 '17 at 11:43

I would personally avoid using regex for parsing HTML. To get you started, xpath expression to get <li> tag following the title "Full codes" is about like this :

//h2[span='Full codes']/following::li

But to be more precise, you can select <ul> sibling instead, then get the <li> child next :

//h2[span='Full codes']/following-sibling::ul/li

Side note, HtmlAgilityPack's HtmlWeb also works to load that wikipedia page in a much shorter way :

var doc = new HtmlWeb().Load("http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain");
if (doc.DocumentNode != null)
{
    var data = doc.DocumentNode.SelectNodes("//h2[span='Full codes']/following-sibling::ul/li");
    foreach (HtmlNode htmlNode in data)
    {
        Console.WriteLine(htmlNode.InnerText.Trim());
    }
}

Get Zip Codes in Wikipedia with HTML Document

1 Answers1