4

I'm parsing an XML in python. I've an XSD schema to validate the XML. Can I get the type of a particular node of my XML as it was defined in XSD?

For example, my XML (small part) is

<deviceDescription>
  <wakeupNote>
    <lang xml:lang="ru">Русский</lang>
    <lang xml:lang="en">English</lang>
  </wakeupNote> 
</deviceDescription>

My XSD is (once again a small part of it):

<xsd:element name="deviceDescription" type="zwv:deviceDescription" minOccurs="0"/>

<xsd:complexType name="deviceDescription">
  <xsd:sequence>
    <xsd:element name="wakeupNote" type="zwv:description" minOccurs="0">
      <xsd:unique name="langDescrUnique">
        <xsd:selector xpath="zwv:lang"/> 
        <xsd:field xpath="@xml:lang"/>  
      </xsd:unique>
    </xsd:element> 
  </xsd:sequence>
</xsd:complexType>

<xsd:complexType name="description">
  <xsd:sequence>
    <xsd:element name="lang" maxOccurs="unbounded">
      <xsd:complexType>
        <xsd:simpleContent>
          <xsd:extension base="xsd:string">
            <xsd:attribute ref="xml:lang" use="required"/>
          </xsd:extension>
        </xsd:simpleContent>
      </xsd:complexType>
    </xsd:element>
  </xsd:sequence> 
</xsd:complexType>

During the parse I want to know that my tag wakeupNote is defined in XSD as complexType zwv:description. How to do this (in python)?

What do I need this for? Suppose I have a lot of these XMLs and I want to check that all of them have fields with English language filled. It would be easy to check that the <lang xml:lang="en"></lang> is empty, but it is allowed not to specify this tag at all.

So the idea is to get all tags that may have language descriptions and check that <lang> tag is present and has a non-empty content for en.

UPD

Since during validation my XML is checked against XSD, the validation engine knows types of all nodes. I had a similar question 7 month ago which is still with no answer. They are related, imho. Validating and filling default values in XML based on XSD in Python

Community
  • 1
  • 1
PoltoS
  • 1,232
  • 1
  • 12
  • 32

2 Answers2

0

You're right that the validator must know the type associations of all the elements and attributes it validates, and that the validator is thus in a position to provide access to that information.

For better or worse, however, both the API between caller and validator and the selection of validation-related information available to the caller is completely implementation-defined. Some validators (Xerces J is a notable example) make a very full range of validation information available; others don't.

Without knowing what validator you are using, no one can tell you with certainty whether the type information you're seeking is available. Since you're calling the validator, there must be an API; if type associations are available through the API, presumably the documentation will say so. If the API doesn't provide access to it, it may be because the underlying schema validator doesn't provide access to the information, or it may be because the creator of the API didn't see the point; your job (if you want to pursue this further) will be to find out which of those is the case and then try to persuade the relevant parties that it would be useful to make the information available.

If you have no luck with getting access to the information through the API, you can help yourself with a more sophisticated version of the approach mentioned in another answer by David W. It is a property of XSD schemas that the governing type of any element is strictly a function of the path to that element from the validation root, so it is straightforward in principle (if more than a bit tedious in practice) to identify, for any element in a document instance, what its governing type will be if the document instance is validated against a particular schema. For the case you mention, for example, it is straightforward to tell whether a given wakeupNote has deviceDescription or otherElement as an ancestor, or which is the nearer ancestor if the wakeupNote has both, and to infer the appropriate governing type definition based on that knowledge.

Helping yourself in this way is likely to require a non-trivial amount of work. It would help if there were general-purpose tools to calculate this information and make it accessible in various forms, but if there are any such, I don't know about them. (I do know people who could build such a tool for a fee.) So if I were you I'd try to get the information through the API first.

C. M. Sperberg-McQueen
  • 24,596
  • 5
  • 38
  • 65
0

If the question is: How do I find the name of the type for a given XML node? The answer is to use xpath in python to look it up. The xpath to run on the xsd will be

//element[@name='wakeupNote']/@type

this should return zwv:description. If it returns two types, you'll have to walk from the root

/root/foo/wakeupNote (type A)
/root/bar/wakeupNote (type B)

This will be tedious walking down from the root. You'll have to look for both anonomous and named types.

If the question is: How do I find all XML nodes of a given type? If the schema will change frequently, you could test the type of every node as you parse it with the above method.

If the schema is well known, fixed, and the nodes you are looking for are findable with XPATH you could test each node.

//@xml:lang='en'

Then use python to check the length of each.

In the stable-schema case, you could write a second XSD that enforces the criteria you are looking for.

Community
  • 1
  • 1
David W
  • 945
  • 9
  • 21
  • I want to find all XML nodes that are defined in XSD as `zwv:description`, and not all definitions in the XSD. For example, in my XSD I can define two `wakeupNote`: one `inside deviceDescription` as `zwv:description` and one inside another tag as `zwv:shortdescription`. So in my XML I'll have two types of `wakeupNote`. And I need to select only those having type `zwv:description`. How to do it? – PoltoS Feb 03 '11 at 08:58
  • Please add a comment if the edit doesn't meet your need. Please don't down vote without giving a chance to clarify the question and the answer. – David W Feb 03 '11 at 18:25
  • the more I think about this, the more I need to understand the use case to make a good recommendation. How are you parsing the XML? How frequently does the schema change? Is the XSD yours or the other party? If it is the other party, why do you want to impose additional validation? – David W Feb 04 '11 at 03:27
  • This XSD may be changed by other party without any notifications and it is complex enought to check it. XMLs (their number will grow each day) would be checked from time to time to see if there are untranslated values. My idea is based on the faith that during validation XML engine does checks and associate all XML fields with types defined in XSD anyway. So it should be possible to extract this information from the engine. Same with my related question about filling the default values. – PoltoS Feb 04 '11 at 10:29
  • Is your idea to parse the XML and at the same type do XPath in XSD to find the type of each node in XML? This is head-on solution and looks to heavy. Can't I get this from validation engine? Otherwise I'll in fact write half of my own validation engine. – PoltoS Feb 04 '11 at 10:33
  • Yes, my first ideas was to run an XPath query on the XSD while parsing the XML. I agree that is is complex with a schema could change. The more I think about it, the more this sounds like a case of applying additional validation. This could be done with a schema of your own. Or in python as you import. Is the non-empty english elements one of the problems you are really trying to solve? What python parser are you using for the XML? Are you constructing Python classes to match the XSD (a Description class, etc)? – David W Feb 04 '11 at 14:42
  • I'm using three different parsers: xml.dom.minidom (standard Python module), XMLObject (to map XML to objects just to simplify the life) and lxml.etree to validate my XML against the XSD. Unfortunately I don't know a really good library for Python that can read XML, validate it and be comfortable to use at the same time. – PoltoS Feb 11 '11 at 01:02
  • I posted a question to see if there was a library that both validates and populates the default values. I think if you want to stay with python, it looks like you have to choose between creating the defaults and validating with python code and/or using addtional XLST/XSD. http://stackoverflow.com/questions/4900867/is-there-a-xml-schema-validation-library-that-supports-the-default-attribute-valu/4901794#4901794 – David W Feb 13 '11 at 18:41