0

I'm trying to scrape a website - ive accomplished this on other projects but i cant seem to get this right. It could be that ive been up for over 2 days working and maybe i am missing something. Please could someone look over my code? Here it is :

using System;
using System.Collections.Generic;
using HtmlAgilityPack;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Linq;
using System.Xml.Linq;
using System.IO;

public partial class _Default : System.Web.UI.Page
{
    List<string> names = new List<string>();
    List<string> address = new List<string>();
    List<string> number = new List<string>();
    protected void Page_Load(object sender, EventArgs e)
    {
        string url = "http://www.scoot.co.uk/find/" + "cafe" + " " + "-in-uk?page=" + "4";
        var Webget = new HtmlWeb();
        var doc = Webget.Load(url);
        List<List<string>> mainList = new List<List<string>>();

        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
        {
            names.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, @"\s{2,}", " "));
        }
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p[@class='result-address']"))
        {
            address.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, @"\s{2,}", " "));
        }
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p[@class='result-number']"))
        {
            number.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, @"\s{2,}", " "));
        }

        XDocument doccy = new XDocument(

new XDeclaration("1.0", "utf-8", "yes"),

new XComment("Business For Sale"),

new XElement("Data",

from data in mainList
select new XElement("data", new XAttribute("data", "data"),
new XElement("Name : ", names[0]),
new XElement("Add : ", address[0]),
new XElement("Number : ", number[0])
)
)

);

        var xml = doccy.ToString();

        Response.ContentType = "text/xml"; //Must be 'text/xml'
        Response.ContentEncoding = System.Text.Encoding.UTF8; //We'd like UTF-8
        doccy.Save(Response.Output); //Save to the text-writer

    }

}

The website lists business name, phone number and address and they are all defined by a class name (result-address, result-number etc). I am trying to get XML output so i can get the business name, address and phone number from each listing on page 4 for a presentation tomorrow but i cant get it to work at all!

The results are right in all 3 of the for each loops but they wont output in the xml i get an out of range error.

Gaz Smith
  • 1,100
  • 1
  • 16
  • 30
  • It's the xml part im getting the list in all 4 of the loops but i guess its putting them together in the xmlobjects. i can get them all there as separate objects but i need it to be name,address,pnumname, address,pnum etc rather than nameaddresspnum – Gaz Smith Aug 27 '16 at 00:49
  • without some sample html it's hard to say, but usually each listing will have a common parent element. I'd recommend selecting the common parent element and doing the foreach node on that one. Then select inner nodes and get the individual values one by one. This will allow you to define a custom object and populate the custom object which would be easy to serialize to XML. not really a direct answer I'm sorry but maybe a different approach. – Joe_DM Aug 27 '16 at 00:54
  • The url for the html page is in the code thats why i added the whole lot – Gaz Smith Aug 27 '16 at 00:57
  • Please [edit] your post and remove code that works and keep only pieces that demonstrate what is broken (see [MCVE] for guidance). It sounds like whole HTML parsing part is completely unrelated to problem you have (along with "web-scraping" tag). – Alexei Levenkov Aug 27 '16 at 01:01
  • well its not because i think the process of the collection may be wrong, i think the whole code is relevant or i wouldn't have posted it all. – Gaz Smith Aug 27 '16 at 01:03
  • It works as in it it gathers the data i need, but for it to be displayed correctly in the xml im sure that the for loops will somehow need to be combined. – Gaz Smith Aug 27 '16 at 01:04
  • I'll have an answer for you in a second. just putting on finishing touches, so don't stress :) – Joe_DM Aug 27 '16 at 01:36

1 Answers1

1

My first piece of advice would be to keep your CodeBehind as light as possible. If you bloat it up with business logic then the solution will become difficult to maintain. That's off topic, but I recommend looking up SOLID principles.

First, I've created a custom object to work with instead of using Lists of strings which have no way of knowing which address item links up with which name:

public class Listing
{
    public string Name { get; set; }
    public string Address { get; set; }
    public string Number { get; set; }
}

Here is the heart of it, a class that does all the scraping and serializing (I've broken SOLID principles but sometimes you just want it to work right.)

using System.Collections.Generic;
using HtmlAgilityPack;
using System.IO;
using System.Xml;
using System.Xml.Serialization;
using System.Linq;
public class TheScraper
{
    public List<Listing> DoTheScrape(int pageNumber)
    {
        List<Listing> result = new List<Listing>();

        string url = "http://www.scoot.co.uk/find/" + "cafe" + " " + "-in-uk?page=" + pageNumber;

        var Webget = new HtmlWeb();
        var doc = Webget.Load(url);

        // select top level node, this is the closest we can get to the elements in which all the listings are a child of.
        var nodes = doc.DocumentNode.SelectNodes("//*[@id='list']/div/div/div/div");

        // loop through each child 
        if (nodes != null)
        {
            foreach (var node in nodes)
            {
                Listing listing = new Listing();

                // get each individual listing and manually check for nulls
                // listing.Name = node.SelectSingleNode("./div/div/div/div/h2/a")?.InnerText; --easier way to null check if you can use null propagating operator
                var nameNode = node.SelectSingleNode("./div/div/div/div/h2/a");
                if (nameNode != null) listing.Name = nameNode.InnerText;

                var addressNode = node.SelectSingleNode("./div/div/div/div/p[@class='result-address']");
                if (addressNode != null) listing.Address = addressNode.InnerText.Trim();

                var numberNode = node.SelectSingleNode("./div/div/div/div/p[@class='result-number']/a");
                if (numberNode != null) listing.Number = numberNode.Attributes["data-visible-number"].Value;

                result.Add(listing);
            }
        }

        // filter out the nulls
        result = result.Where(x => x.Name != null && x.Address != null && x.Number != null).ToList();

        return result;
    }

    public string SerializeTheListings(List<Listing> listings)
    {
        var xmlSerializer = new XmlSerializer(typeof(List<Listing>));

        using (var stringWriter = new StringWriter())
        using (var xmlWriter = XmlWriter.Create(stringWriter, new XmlWriterSettings { Indent = true }))
        {
            xmlSerializer.Serialize(xmlWriter, listings);
            return stringWriter.ToString();
        }
    }
}

Then your code behind would look something like this, plus references to the scraper class and model class:

public partial class _Default : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        TheScraper scraper = new TheScraper();
        List<Listing> listings = new List<Listing>();
        // quick hack to do a loop 5 times, to get all 5 pages. if this is being run frequently you'd want to automatically identify how many pages or start at page one and find / use link to next page.
        for (int i = 0; i < 5; i++)
        {
            listings = listings.Union(scraper.DoTheScrape(i)).ToList();
        }            
        string xmlListings = scraper.SerializeTheListings(listings);
    }
}
Joe_DM
  • 985
  • 1
  • 5
  • 12
  • Thanks very much for your answer, although i do get Severity Code Description Project File Line Suppression State Error CS8026 Feature 'null propagating operator' is not available in C# 5. Please use language version 6 or greater. website(2) C:\Users\g.smith\valuationapplication\Gatwick Web Scrape\website\gatwickxml.aspx.cs 91 Active – Gaz Smith Aug 27 '16 at 02:16
  • you can replace the "?" with manual null checks, I think that's what it's referring to. really dirty example: node.SelectSingleNode("./div/div/div/div/h2/a") != null ? node.SelectSingleNode("./div/div/div/div/h2/a").InnerText : null; – Joe_DM Aug 27 '16 at 02:30
  • also see, http://stackoverflow.com/questions/27968963/c-sharp-6-0-features-not-working-with-visual-studio-2015 – Joe_DM Aug 27 '16 at 02:33
  • one last thing @joe_DM is there a way to loop this so i can run it for the first 5 pages? – Gaz Smith Aug 27 '16 at 02:38
  • @pandemic I just quickly added a way to loop for 4 pages in the above answer, it's not tested but gives you an idea. – Joe_DM Aug 27 '16 at 02:44
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121984/discussion-between-joe-dm-and-pandemic). – Joe_DM Aug 27 '16 at 02:58