I have this simple XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE input[
<!ELEMENT input (#PCDATA)>
<!ELEMENT file (#PCDATA)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT type (#PCDATA)>
]>
<input>
This is the content <file><name>test.png</name><type>Image</type></file>
</input>
I expect this to be valid but some online validators report that it is invalid because the input and file elements contain non-text nodes.
If I remove the file element within the input element then the resulting XML is reported to be valid, so I expect the "non-text nodes" are the child elements (file in input and name and type in file).
I expect this to be valid because the XML specification for an element specifies that an element is valid if it matches one of a set of conditions, one of which is:
The declaration matches Mixed, and the content (after replacing any entity references with their replacement text) consists of character data (including CDATA sections), comments, PIs and child elements whose types match names in the content model.
Note the "and child elements..." towards the end of that.
And the production for mixed is:
Mixed ::= '(' S? '#PCDATA' (S? '|' S? Name)* S? ')*'
| '(' S? '#PCDATA' S? ')'
The second case is what I have for input and file: (#PCDATA)
The validity requirement for mixed content is that there can be child elements as long as their names match names in the content model, which they do.
Am I misunderstanding the specification or are these validators incorrect?
If I remove the declarations of the file, name and type elements from the DTD but leave the child elements in the content of the input element, then I get additional validation errors indicating no declaration of those types. I expect these errors because the validation requirement is that the child element names match names in the content model and, with those declarations removed, they don't match names in the content model.
But there are other validators that report the XML is valid even without the declarations of the file, name and type elements in the DTD. This too seems to be a fault of the validators as the validation requirement clearly says that the child element names must match names in the content model, which they don't, when those element declarations are removed.
I know there are various XML validation implementations and they do not all work the same so they cannot all be strictly correct. I am most interested in having a strictly correct understanding of the specification.
In strict conformance to the validity requirements of an element with content (#PCDATA)
:
- Can the content of that element include child elements?
- If so, must the names of those elements match names of elements in the DTD?
The specification only appears to require that the names of child elements match names of elements in the DTD but I think reasonably the content and attributes of such elements should also match the declarations in the DTD, but the specification doesn't actually say this. So, again, in strict conformance with the validity requirements of the specification, must the content and attributes of a child element of an element with content (#PCDATA)
match the declarations of these in the DTD? If so, where in the specification does it say so?
Finally, is there any easy to use (online or installable to Linux) XML validator that is strictly correct according to the specification that you can recommend?