5

I was wondering the best way for HtmlAgilityPack to read an xml file that includes an xsl file to render html. Are there any settings on the HtmlDocument class that would assist in this, or do I have to find a way to execute the transformation before loading it with HtmlAgiliyPack? If yes for the latter, anybody know of a good library or method for such a transformation? Below is an example of a website that returns xml with xls file and the code that I would like to use.

var uri = new Uri("http://www.skechers.com/");
var request = (HttpWebRequest)WebRequest.Create(url);
var cookieContainer = new CookieContainer();

request.CookieContainer = cookieContainer;
request.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
request.Method = "GET";
request.AllowAutoRedirect = true;
request.Timeout = 15000;

var response = (HttpWebResponse)request.GetResponse();
var page = new HtmlDocument();
page.OptionReadEncoding = false;
var stream = response.GetResponseStream();
page.Load(stream); 

This code does not throw any errors, but the xml is what gets parsed and not the transformation, which is what I want.

Adrian Adkison
  • 3,537
  • 5
  • 33
  • 36
  • 2
    If you have well-formed XML, why use the HtmlAgilityPack at all? – Cameron Mar 21 '11 at 23:50
  • I am trying to get a page summary, i.e. page title, and meta description, and a list of the img srcs on the page. I am allowing input of any valid url from the web. So to answer your question, I dont always have well-formed xml and even if I did, the document title and description would be formatted inconsistently. – Adrian Adkison Mar 21 '11 at 23:57

3 Answers3

3

Html Agility Pack can help you here on two points:

1) it's easier to get an Xml processing instruction with it as it parses the PI data as Html, so it will transform it into attributes

2) HtmlDocument implements IXPathNavigable so it can be transformed directly by the .NET Xslt transformation engine.

Here is a piece of code that works. I had to add a specific XmlResover to handle Xslt transform properly, but I think this is specific to this skechers case.

public static void DownloadAndProcessXml(string url, string userAgent, string outputFilePath)
{
    using (XmlTextWriter writer = new XmlTextWriter(outputFilePath, Encoding.UTF8))
    {
        DownloadAndProcessXml(url, userAgent, writer);
    }
}

public static void DownloadAndProcessXml(string url, string userAgent, XmlWriter output)
{
    UserAgentXmlUrlResolver resolver = new UserAgentXmlUrlResolver(url, userAgent);

    // WebClient is an easy to use class.
    using (WebClient client = new WebClient())
    {
        // download Xml doc. set User-Agent header or the site won't answer us...
        client.Headers[HttpRequestHeader.UserAgent] = resolver.UserAgent;
        HtmlDocument xmlDoc = new HtmlDocument();
        xmlDoc.Load(client.OpenRead(url));

        // determine xslt (note the xpath trick as Html Agility Pack does not support xml processing instructions)
        string xsltUrl = xmlDoc.DocumentNode.SelectSingleNode("//*[name()='?xml-stylesheet']").GetAttributeValue("href", null);

        // download Xslt doc
        client.Headers[HttpRequestHeader.UserAgent] = resolver.UserAgent;
        XslCompiledTransform xslt = new XslCompiledTransform();
        xslt.Load(new XmlTextReader(client.OpenRead(url + xsltUrl)), new XsltSettings(true, false), null);

        // transform Html/Xml doc into new Xml doc, easy as HtmlDocument implements IXPathNavigable
        // note the use of a custom resolver to overcome this Xslt resolve requests
        xslt.Transform(xmlDoc, null, output, resolver);
    }
}

// This class is needed during transformation otherwise there are errors.
// This is probably due to this very specific Xslt file that needs to go back to the root document itself.
public class UserAgentXmlUrlResolver : XmlUrlResolver
{
    public UserAgentXmlUrlResolver(string rootUrl, string userAgent)
    {
        RootUrl = rootUrl;
        UserAgent = userAgent;
    }

    public string RootUrl { get; set; }
    public string UserAgent { get; set; }

    public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
    {
        WebClient client = new WebClient();
        if (!string.IsNullOrEmpty(UserAgent))
        {
            client.Headers[HttpRequestHeader.UserAgent] = UserAgent;
        }
        return client.OpenRead(absoluteUri);
    }

    public override Uri ResolveUri(Uri baseUri, string relativeUri)
    {
        if ((relativeUri == "/") && (!string.IsNullOrEmpty(RootUrl)))
            return new Uri(RootUrl);

        return base.ResolveUri(baseUri, relativeUri);
    }
}

And you call it like this:

    string url = "http://www.skechers.com/";
    string ua = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
    DownloadAndProcessXml(url, ua, "skechers.html");
Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • thanks again, I think for my purposes, the code that I have will work a little better. I think for a general guide on how do this I would recommend your code. Btw, HtmlAgilityPack is f*#%ing awesome. – Adrian Adkison Mar 22 '11 at 18:09
  • I would also like to add that it would be cool to be able to pass in an html string into the HtmlDocument.Load method instead of having to create a stream manually. I do see that it already has like 12 overloads! – Adrian Adkison Mar 22 '11 at 18:12
  • @Adrian Adkison - There is a LoadHtml overload for this purpose. – Simon Mourier Mar 22 '11 at 19:13
2

You should render the output of the XML and XSLT. To do this you need to download the XML, and you've already done that. Next parse the XML to identify the XSL reference. Then you need to download the XSL and apply that to the XML document.

These links may be useful

Community
  • 1
  • 1
Brian Lyttle
  • 14,558
  • 15
  • 68
  • 104
0

Here is the additional code I ended up using once I received the response. Please note that this is only good if the response is "application/xml" and you will have to check for null instances of objects throughout. Also, FormAssetSrc is a private function that takes the value of the href and determines whether it is protocol, root, or document relative and creates the fully qualified uri.

var xmlStream = response.GetResponseStream();
var xmlDocument = new XPathDocument(xmlStream);
var styleNode = xmlDocument.CreateNavigator().SelectSingleNode("processing-instruction('xml-stylesheet')");
var hrefValue = Regex.Match((styleNode).Value, "href=(\"|')(?<url>.*?)(\"|')");
if(hrefValue.Success)
{
    var xslHref = FormAssetSrc(hrefValue.Groups["url"].Value, response.ResponseUri);
    var xslUri = new Uri(xslHref);
    var xslRequest = CreateWebRequest(xslUri);
    var xslResponse = (HttpWebResponse)xslRequest.GetResponse();
    var xslStream = new XPathDocument(xslResponse.GetResponseStream());
    var xslTransorm = new XslTransform();
    var sw = new System.IO.StringWriter();
    xslTransorm.Load(xslStream);
    xslTransorm.Transform(xmlDocument.CreateNavigator(), null, sw);
    page.Html.LoadHtml(sw.ToString());
}
Adrian Adkison
  • 3,537
  • 5
  • 33
  • 36
  • CreateWebRequest is also a private function that creates a request like the in the first code snippet of the original question – Adrian Adkison Mar 22 '11 at 04:09