0

Is there something I need to configure in the XmlReaderSettings to encourage .net (4.8, 6, 7) to handle some cXML without throwing the following exception:

Unhandled exception. System.Xml.Schema.XmlSchemaException: The parameter entity replacement text must nest properly within markup declarations.

Sample cXML input

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cXML SYSTEM "http://xml.cxml.org/schemas/cXML/1.2.041/cXML.dtd">
<cXML payloadID="donkeys@example.com" timestamp="2023-02-13T01:01:01Z">
  <Header>
  </Header>
  <Request deploymentMode="production">
  </Request>
</cXML>

Sample Application

using System.Xml;
using System.Xml.Linq;

namespace Donkeys
{
    internal class Program
    {
        static void Main()
        {
            XmlReaderSettings settings = new()
            {
                XmlResolver = new XmlUrlResolver(),
                DtdProcessing = DtdProcessing.Parse,
                ValidationType = ValidationType.DTD,
            };

            FileStream fs = File.OpenRead("test.xml"); // sample cXML from question
            XmlReader reader = XmlReader.Create(fs, settings);

            XDocument.Load(reader); // this blows up
        }
    }
}

I'm looking to use the XmlUrlResolver to cache the DTDs but without ignoring the validation I get the error above but i'm not really sure why?

So far I've tried different validation flags but they don't validate at all unless I use ValidationType.DTD which goes pop.

The actual resolver seems to work fine; if I subclass it, it is returning the DTD (as a MemoryStream) as expected.

I can add an event handler to ignore the issue but this feels lamer than I'd like.

using System.Xml;
using System.Xml.Linq;

namespace Donkeys
{
    internal class Program
    {
        static void Main()
        {
            XmlReaderSettings settings = new()
            {
                XmlResolver = new XmlUrlResolver(),
                DtdProcessing = DtdProcessing.Parse,
                ValidationType = ValidationType.DTD,
                IgnoreComments = true
            };

            settings.ValidationEventHandler += Settings_ValidationEventHandler;

            FileStream fs = File.OpenRead("test.xml");
            XmlReader reader = XmlReader.Create(fs, settings);

            XDocument dogs = XDocument.Load(reader);
         }

        private static void Settings_ValidationEventHandler(object? sender, System.Xml.Schema.ValidationEventArgs e)
        {
            // this seems fragile
            if (e.Message.ToLower() == "The parameter entity replacement text must nest properly within markup declarations.".ToLower()) // and this would be a const
                return;

            throw e.Exception;
        }
    }
}
tobyd
  • 316
  • 3
  • 19

1 Answers1

1

I've spent some time over the last few days looking into this and trying to get my head around what's going on here.

As far as I can tell, the error The parameter entity replacement text must nest properly within markup declarations is being reported incorrectly. My understanding of the spec is that this message means that you have mismatched < and > elements in the replacement text of a parameter entity in a DTD.

The following example is taken from this O'Reilly book sample page and demonstrates something that genuinely should reproduce this error:

<!ENTITY % finish_it ">">
<!ENTITY % bad "won't work" %finish_it;

Indeed the .NET DTD parser reports the same error for these two lines of DTD.

This doesn't mean you can't have < and > characters in parameter entity replacement text at all: the following two lines will declare an empty element with name Z, albeit in a somewhat round-about way:

<!ENTITY % Nested "<!ELEMENT Z EMPTY>">
%Nested;

The .NET DTD parser parses this successfully.

However, the .NET DTD parser appears to be objecting to this line in the cXML DTD, which defines the Object.ANY parameter entity:

<!ENTITY % Object.ANY '|xades:QualifyingProperties|cXMLSignedInfo|Extrinsic'>

There are of course no < and > characters in the replacement text, so the error is baffling.

This is by no means a new problem. I found this unanswered Stack Overflow question which basically reports the same problem. Also, this MSDN Forum post basically has the same problem, and it was asked in 2007. So is this unclear but intentional behaviour, or a bug that has been in .NET for 15+ years? I don't know.

For those who do want to look into things further, the following is about the minimum necessary to reproduce the problem. The necessary C# code to read the XML file can be taken from the question and adapted, I don't see the need to repeat it here:

example.dtd:

<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT A EMPTY>
<!ENTITY % Rest '|A' >
<!ELEMENT example (#PCDATA %Rest;)*>

example.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE example SYSTEM "example.dtd">
<example/>

There are various ways to tweak this to get rid of the error. One way is to move the | character from the parameter entity into the ELEMENT example declaration. Replacing #PCDATA with another element (which you would also have to define) is another way.


But enough of the theory behind the problem. How can you actually move forwards with this?

I would take a local copy of the cXML DTD and adjust it to work around this error. You can download the DTD from the URL in your sample cXML input. The %Object.ANY; parameter entity is only used once in the DTD: I would replace this one occurrence with the replacement text, |xades:QualifyingProperties|cXMLSignedInfo|Extrinsic.

You then need to adjust the .NET XML parser to use your modified copy of the cXML DTD instead of fetching the the one from the given URL. You create a custom URL resolver for this, for example:

using System.Xml;

namespace Donkeys
{
    internal class CXmlUrlResolver : XmlResolver
    {
        private static readonly Uri CXml1_2_041 = new Uri("http://xml.cxml.org/schemas/cXML/1.2.041/cXML.dtd");

        private readonly XmlResolver urlResolver;

        public CXmlUrlResolver()
        {
            this.urlResolver = new XmlUrlResolver();
        }

        public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
        {
            if (absoluteUri == CXml1_2_041)
            {
                // Return a Stream that reads from your custom version of the DTD,
                // for example:
                return File.OpenRead(@"SomeFilePathHere\cXML-1.2.401.dtd");
            }

            return this.urlResolver.GetEntity(absoluteUri, role, ofObjectToReturn);
        }
    }
}

This checks to see what URI is being requested, and if it matches the cXML URI, returns a stream that reads from your customised copy of the DTD. If some other URI is given, it passes the request to the nested XMLResolver, which then deals with it. You will of course need to use an instance of CXmlUrlResolver instead of XmlUrlResolver() when creating your XmlReaderSettings.

I don't know how many versions of cXML you will have to deal with, but if you are dealing with multiple versions, you might have to create a custom copy of the DTD for each version, and have your resolver return the correct local copy for each different URI.

A similar approach is given at this MSDN Forums post from 2008, which also deals with difficulties parsing cXML with .NET. This features a custom URL resolver created by subclassing XmlUrlResolver. Those who prefer composition over inheritance may prefer my custom URL resolver instead.

Luke Woodward
  • 63,336
  • 16
  • 89
  • 104
  • Thanks for the comprehensive investigation - some interesting stuff going on in there. I did find various duplicates* of this question floating about in various places going back several years but no definitive fix. I don't have much control over the cXML versioning I have to deal with so keeping a fixed version of every permutation might not be doable but I can certainly use the same general idea and try and correct the DTD on-the-fly and cache a local copy for versions I've not seen yet. – tobyd Feb 19 '23 at 20:44