2

I have this simple XML:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE input[
<!ELEMENT input (#PCDATA)>
<!ELEMENT file (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT type (#PCDATA)>
]>
<input>
This is the content <file><name>test.png</name><type>Image</type></file>
</input>

I expect this to be valid but some online validators report that it is invalid because the input and file elements contain non-text nodes.

If I remove the file element within the input element then the resulting XML is reported to be valid, so I expect the "non-text nodes" are the child elements (file in input and name and type in file).

I expect this to be valid because the XML specification for an element specifies that an element is valid if it matches one of a set of conditions, one of which is:

The declaration matches Mixed, and the content (after replacing any entity references with their replacement text) consists of character data (including CDATA sections), comments, PIs and child elements whose types match names in the content model.

Note the "and child elements..." towards the end of that.

And the production for mixed is:

    Mixed      ::=      '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'  
            | '(' S? '#PCDATA' S? ')' 

The second case is what I have for input and file: (#PCDATA)

The validity requirement for mixed content is that there can be child elements as long as their names match names in the content model, which they do.

Am I misunderstanding the specification or are these validators incorrect?

If I remove the declarations of the file, name and type elements from the DTD but leave the child elements in the content of the input element, then I get additional validation errors indicating no declaration of those types. I expect these errors because the validation requirement is that the child element names match names in the content model and, with those declarations removed, they don't match names in the content model.

But there are other validators that report the XML is valid even without the declarations of the file, name and type elements in the DTD. This too seems to be a fault of the validators as the validation requirement clearly says that the child element names must match names in the content model, which they don't, when those element declarations are removed.

I know there are various XML validation implementations and they do not all work the same so they cannot all be strictly correct. I am most interested in having a strictly correct understanding of the specification.

In strict conformance to the validity requirements of an element with content (#PCDATA):

  1. Can the content of that element include child elements?
  2. If so, must the names of those elements match names of elements in the DTD?

The specification only appears to require that the names of child elements match names of elements in the DTD but I think reasonably the content and attributes of such elements should also match the declarations in the DTD, but the specification doesn't actually say this. So, again, in strict conformance with the validity requirements of the specification, must the content and attributes of a child element of an element with content (#PCDATA) match the declarations of these in the DTD? If so, where in the specification does it say so?

Finally, is there any easy to use (online or installable to Linux) XML validator that is strictly correct according to the specification that you can recommend?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Ian
  • 2,078
  • 2
  • 19
  • 19

1 Answers1

1

Your element declaration,

<!ELEMENT input (#PCDATA)>

technically qualifies as allowing mixed content, but does not allow any elements to be mixed in.

The section you cite says that mixed content may contain character data, optionally interspersed with child elements. This is supported by the production in that section. See ^^^ below which allows elements to be mixed in if provided by Name:

Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'  
                           ^^^^^^^^^^^^^^^^^       
        | '(' S? '#PCDATA' S? ')' 

However, your declaration does not actually allow elements. If you wish elements such as file to be allowed to be mixed in, instead declare input like this:

<!ELEMENT input (#PCDATA|file)*>

Update to address follow-up comments

Any & and < characters that appear in parsed character data will be parsed: That is, interpreted as markup. Rules of well-formedness apply, and during validation the parsed markup must follow the grammar rules given by the schema. An element with only #PCDATA in its content model does not implicitly allow interspersed elements that aren't mentioned in the content model.

Colloquially, mixed content typically implies the presence of interspersed elements; technically, mixed content may have zero or more elements1. Either way, the document is not valid if elements are interspersed with parsed data but not specified in the content model.


1 Again, note the spec says optionally interspersed. Here is the full definition:

3.2.2 Mixed Content

[Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • I was thinking this might be the answer. On the other hand, my declaration is like one of the explicit examples of a mixed content declaration and it's not mixed content if it doesn't allow characters with interspersed elements - the definition of mixed content. I find the specification quite ambiguous. Is there any production for the #PCDATA itself, anywhere? – Ian Oct 16 '20 at 08:13
  • Also, from section 2.2: [Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] . And I assume #PCDATA content is a parsed entity, but maybe I'm wrong about that, but if it is, then it can contain markup, which an element is. – Ian Oct 16 '20 at 08:21
  • Section 1 seems even clearer: XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Doesn't #PCDATA correspond to the case of parsed data made up of characters? – Ian Oct 16 '20 at 08:29
  • Thank you for the update. In the meantime I have been looking through the tests in the [Extensible Markup Language (XML) Conformance Test Suites](https://www.w3.org/XML/Test/). Haven't yet seen like my example but haven't reviewed all yet. Everything I have seen is consistent with your explanation so, while the specification still seems ambiguous to me, I suspect what you say is the usual interpretation. – Ian Oct 17 '20 at 09:29