0

I'm reading this Wikipedia page -> http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain, a list of zip codes in Spain.

My goal is to get all zip codes from the section "Full codes" in webpage. For example i need to get this information (zip code - locality):

03000 to 03099 - Alicante 03189 - Villamartin 03201 to 03299 - Elche 03400 - Villena

In my code, I have difficult to get only li and a tags after the title "Full Codes".

    HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain");
    request.UserAgent = "Test wiki";
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    StreamReader reader = new StreamReader(stream);
    string htmlText = reader.ReadToEnd();

    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(htmlText);

    if (doc.DocumentNode != null)
    {
        HtmlNodeCollection divs = doc.DocumentNode.SelectNodes("//li");
        foreach (HtmlNode listElement in divs)
        {
            if (listElement.SelectNodes("//a[@href]").Count > 0)
            { // I do not get what I wish
                foreach (HtmlNode listElement2 in listElement.SelectNodes("//a[@href]"))
                {
                    string s = listElement2.Name;
                    string ss = listElement2.InnerText;
                }
            }
        }
    }
John Saunders
  • 160,644
  • 26
  • 247
  • 397
user2852514
  • 59
  • 2
  • 10

1 Answers1

1

I would personally avoid using regex for parsing HTML. To get you started, xpath expression to get <li> tag following the title "Full codes" is about like this :

//h2[span='Full codes']/following::li

But to be more precise, you can select <ul> sibling instead, then get the <li> child next :

//h2[span='Full codes']/following-sibling::ul/li

Side note, HtmlAgilityPack's HtmlWeb also works to load that wikipedia page in a much shorter way :

var doc = new HtmlWeb().Load("http://en.wikipedia.org/wiki/List_of_postal_codes_in_Spain");
if (doc.DocumentNode != null)
{
    var data = doc.DocumentNode.SelectNodes("//h2[span='Full codes']/following-sibling::ul/li");
    foreach (HtmlNode htmlNode in data)
    {
        Console.WriteLine(htmlNode.InnerText.Trim());
    }
}
Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137