6

I needed to transform the contents of an HTML web page using XSLT . Hence I used SgmlReader and wrote the snippet shown below (I thought, in the end, it's an XmlReader too ...)

XmlReader xslr = XmlReader.Create(new StringReader(
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
    "<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
    "<xsl:template match=\"/\">" +
    "<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
    "</xsl:template>" +
    "</xsl:stylesheet>"));

XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);

using (SgmlReader html = new SgmlReader())
{
    StringBuilder sb = new StringBuilder();
    using (TextWriter sw = new StringWriter(sb))
    using (XmlWriter xw = new XmlTextWriter(sw))
    {
        html.InputStream = new StringReader(Resources.html_orig);
        html.DocType = "HTML";

        try
        {
            xslt.Transform(html, xw);
            string output = sb.ToString();
            System.Console.WriteLine(output);
        }
        catch (Exception exc)
        {
            System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
            System.Console.WriteLine(exc.StackTrace);
        }
    }
}

Nonetheless , I get thos error message

NullReferenceException : Object reference not set to an instance of an object.
   at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
   at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
   at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
   at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
   at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
   at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
   at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)

I found a way to work around this by converting the HTML to XML and then applying the transform , but that's an inefficient solution because :

  1. Intermediate XHTML output goes to a buffer , so extra memory is needed
  2. Conversion process needs extra CPU processing and the same hierarchy is traversed twice (in theory unnecessarily).

So (since I know StackOverflow community always provides great answers whereas other C# forums have completely disappointed me ;o) I'll be looking for feedback and suggestions so as to perform XSL transformations using HTML directly (even if SgmlReader needs to be replaced by another similar library).

Nimantha
  • 6,405
  • 6
  • 28
  • 69
Olemis Lang
  • 738
  • 7
  • 17
  • 1
    About the underline question: XSLT 1.0 works with XML input tree (XSLT 2.0 can use unparsed resources). If you have something that it's not an XML tree, the you need to use some method for mapping this to an XML tree. –  Nov 30 '10 at 15:27
  • 1
    Olemis, just a note, XslCompiledTransform is an XSLT 1.0 processor thus if you use version="2.0" in your stylesheet it runs in forwards compatible mode and you will not get all XSLT 1.0 syntax errors reported. So I would start setting version="1.0" in your stylesheet as then XslCompiledTransform will already inform you on the Load call that your stylesheet is syntactically incorrect as an xsl:output inside of an xsl:template is not allowed. Whether that helps with your problem on feeding an SgmlReader I am not sure, you will need to provide a sample HTML you are using that gives the exception. – Martin Honnen Nov 30 '10 at 16:03

2 Answers2

3

Even if the SgmlReader class is extending the XmlReader class it doesn't mean that it also behaves like an XmlReader.

Technically it also does not make sense that SgmlReader is a subclass of XmlReader, simply because SGML is a superset of XML and not a subset.

You didn't write about the purpose of your transformation, but in general HTML Agility Pack is a good option for manipulating HTML.

Dirk Vollmar
  • 172,527
  • 53
  • 255
  • 316
  • With all respect this **really** makes sense (at least from an *OOP* perspective) due to the fact that *XmlReader* is a type, anything implementing this interface may be manipulated (in this case read) **as if it was** an *XML* document. Indeed it makes sense to implement *XmlReader*s for other structured formats like *YAML* , *INI* files, ... even if they are not markup at all, they are structured documents you might want to read and transform in an structured way. Just my opinion. – Olemis Lang Dec 01 '10 at 21:04
  • @Olemis Lang: In my opinion it does not make sense because the XmlReader expects a well-formed document, i.e. a document with a tree structure. SGML does not provide for that, so methods such as `ReadSubtree` or `ReadInnerXml` do not make any sense. So in the case of running an XSLT on an SgmlReader you might actually run into the case that the underlying engine calls one of these methods but doesn't get what it expects. Also see Alejandro's comment on what is expected by XSLT. – Dirk Vollmar Dec 01 '10 at 21:45
  • & @Alejandro : I thought that was what *SgmlReader* was for (i.e. treat malformed HTML just like if it was its *XHTML* equivalent, all this in the specific case of *HTML*). In fact if you take a look at the tracebacks it seems the reader is used to build an instance of *System.Xml.XPath.XPathDocument* internally, which is the one used by compiled *XSL* transform under-the-hood. Anyway I'll try these methods in a while just to confirm what happens with *ReadSubtree* et al. Thanks – Olemis Lang Dec 02 '10 at 14:12
1

Have you tried using the HTML Agility Pack instead of SgmlReader? You can load the html into it, and run a transform against it directly. I'm not positive if an XML document is created internally, though - although it seems as though one is not you would probably want to compare memory and CPU usage against the conversion method you tried and discarded.

//You already have your xslt loaded into var xslt...

HtmlDocument doc = new HtmlDocument();
doc.Load( ... );  //load your HTML doc, or use LoadXML from a string, etc  
xslt.Transform(doc, xw);

See also this question: How to use HTML Agility pack

Community
  • 1
  • 1
Philip Rieck
  • 32,368
  • 11
  • 87
  • 99
  • Thanks Philip for your reply, but building the *HTML DOM* may be time consuming and extra memory will be used as well. I'd really like to avoid loading objects into memory and extra processing because application should run on devices with limited capabilities . That's why I was looking for a way to feed *HTML* *XMLReader* directly to the *XSLT* (but, considering tracebacks, it seems it builds a *System.Xml.XPath.XPathDocument* internally, so maybe any optimizations I can imaging are just a waste of time ...) – Olemis Lang Dec 01 '10 at 21:14