1

I've got the following "test.xml" file:

<?xml version="1.0" encoding="UTF-8"?>
<test:myXML xmlns:test="http://com/my/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Parent>
  <Child1 xsi:type="sample-type">
    <GrandChild1>123</GrandChild1>
    <GrandChild2>BranchName</GrandChild2>
  </Child1>
  <Child2 xsi:type="sample-type2"></Child2>
</Parent>
</test:myXML>

I would like to retrieve the 'xsi:type' for any node (where it exists). For example, in the above xml, I'd like to iterate over each node and return "sample-type" and "sample-type2"

So far, I've got the below code:

from lxml import etree

XMLDoc = etree.parse("test.xml")
rootXMLElement = XMLDoc.getroot()
tree = etree.parse("test.xml")

for Node in XMLDoc.xpath('//*'):
    if "xsi:type" in Node.attrib:
        #Do whatever

However, this doesn't work because it seems like the the "xsi:type" in the result is literally being replaced by the xmlns:xsi in the namespace declaration. As an illustration, if I print each Node attribute using the below code:

from lxml import etree

XMLDoc = etree.parse("test.xml")
rootXMLElement = XMLDoc.getroot()
tree = etree.parse("test.xml")

for Node in XMLDoc.xpath('//*'):
    print(Node.attrib)

The result is:

{}
{}
{'{http://www.w3.org/2001/XMLSchema-instance}type': 'sample-type'}
{}
{}
{'{http://www.w3.org/2001/XMLSchema-instance}type': 'sample-type2'}

As you can see, where the "xsi-type" attribute exists, it literally replaces it with the xsi in the namespace. How can I stop that from happening? I'd like to search for xsi-type rather than inputting the string literal from the namespace declaration.

Adam
  • 2,384
  • 7
  • 29
  • 66

1 Answers1

4

The xsi is the namespace prefix, it's not the namespace. The only place where the prefix needs to be consistent is within the XML element that declares it.

The prefix does not even need to be consistent within the same XML document, you can have the same namespace being referred to by any number of different prefixes in the same document.

It especially does not have to be consistent between the XML document and your XML processing code, and you should (read: must) not write any code that assumes the prefix or relies on prefix.

This is why if "xsi:type" in Node.attrib: makes no sense - it assumes that the prefix must be xsi. xsi might be commonly used for the http://www.w3.org/2001/XMLSchema-instance namespace, but that's merely a convention, not a guarantee.

The XML document could be written as

<test:myXML xmlns:test="http://com/my/namespace" xmlns:blah="http://www.w3.org/2001/XMLSchema-instance">
<Parent>
  <Child1 blah:type="sample-type">
    <GrandChild1>123</GrandChild1>
    <GrandChild2>BranchName</GrandChild2>
  </Child1>
  <Child2 blah:type="sample-type2"></Child2>
</Parent>
</test:myXML>

and it would be exactly the same thing.

That's why lxml uses the namespace URI, not the prefix, when it displays nodes, or in its XPath dialect - the URI is the important thing, the prefix is ephemeral.

You need to define a namespace map in your program

nsmap = {
  'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

and use that map when you select nodes in a namespace - either explicitly:

if f"{{{nsmap['xsi']}}}type" in node.attrib:
    # ...

or through XPath

type = node.xpath('@xsi:type', nsmap)

This makes your program independent of the prefix - you are free to use any prefix you like, the XML document is free to use any prefix it likes, and the code will work either way.


Extreme example, but to useful to outline the idea:

<test:myXML xmlns:test="http://com/my/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Parent xmlns:blah="http://www.w3.org/2001/XMLSchema-instance">
    <Child1 foo:type="sample-type" xmlns:foo="http://www.w3.org/2001/XMLSchema-instance">
      <GrandChild1>123</GrandChild1>
      <GrandChild2>BranchName</GrandChild2>
    </Child1>
    <Child2 blah:type="sample-type2"></Child2>
  </Parent>
</test:myXML>

Here, http://www.w3.org/2001/XMLSchema-instance gets 3 prefixes. xsi, blah, foo, each one with a different scope.

When this is parsed, which one will you use to refer to xsi? Does it even matter? Should it matter? Nope, it should not. All that needs to match is the namespace URI, we don't care one bit what the XML document does with the prefixes:

nsmap = {
  's': 'http://www.w3.org/2001/XMLSchema-instance'
}

type = node.xpath('@s:type', namespaces=nsmap)
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • I have been looking for an answer to this question also, I came up with something similar. I found these resources, might be good complementary answers: https://stackoverflow.com/a/46422793/2902996 https://stackoverflow.com/questions/59850806/registering-namespaces-with-lxml-before-parsing – Joel Mar 02 '20 at 15:27
  • Sorry if I misunderstood something, but do I have to explicitly specify 'xsi': 'http://www.w3.org/2001/XMLSchema-instance' in the nsmap? If so, how does this make the program independent of the prefix? Is there a way to not have to define it explicitly? – Adam Mar 02 '20 at 15:27
  • Check this @Adam https://en.wikipedia.org/wiki/QName . http://effbot.org/zone/element-namespaces.htm – Joel Mar 02 '20 at 15:29
  • @Adam You don't have to say `'xsi': 'w3.org/2001/XMLSchema-instance'` in the nsmap. You could say `'foobar': 'w3.org/2001/XMLSchema-instance'` and nothing would change, that is the whole point I was trying to make. Prefixes are a convenience feature, they need to be consistent *in their respective scopes*, but not *across scopes*. Your Python program is one scope. If you use `foobar` every time you want to refer to the `'w3.org/2001/XMLSchema-instance'` namespace, that's fine – no matter how that namespace is abbreviated in the XML. – Tomalak Mar 02 '20 at 15:33
  • 1
    @Adam See the extended example at the bottom of my post. – Tomalak Mar 02 '20 at 15:43
  • And to answer your other question, no there is no way to not have to define it explicitly. At some point you have to spell out the namespace URIs you are going to use, there is no way around it. XML does it in `xmlns` declarations, your Python code must do the same thing. The namespace URI is everything. You cannot go by the prefix alone, because the prefix *means nothing*. – Tomalak Mar 02 '20 at 15:50
  • Thank you @Tomalak. I will have a look at all of this, modify my code, and get back to you shortly. I think I should be able to resolve everything given the information you've provided me, but I'll let you know if I'm having difficulties. I should be able to accept your answer as the correct one very soon. – Adam Mar 02 '20 at 16:11
  • Just a quick one, in your example of 'if f"{{{nsmap['xsi']}}}type" in node.attrib: ... how do I get the actual type? If I try it with XPATH, I get an error 'TypeError: xpath() takes exactly 1 positional argument (2 given)' – Adam Mar 02 '20 at 16:15
  • @Adam The true attribute name is `{http://www.w3.org/2001/XMLSchema-instance}type`, just as your `print()` statement shows. `f"{{{nsmap['xsi']}}}type"` is just a shorter way to create it, using Python's format strings. So `node.attrib[f"{{{nsmap['xsi']}}}type"]` would work. Maybe you want to store the name in a variable and use that. – Tomalak Mar 02 '20 at 16:24
  • @Adam The other error is because `lxml` requires a keyword argument for the namespace map - a positional argument does not work. My fault, see updated code sample. – Tomalak Mar 02 '20 at 16:29
  • @Tomalak thank you again. Sorry if you've mentioned this and I misunderstood, but is there a way that I can get the attribute of the namespace including xsi? For example, a command that can return the xmlns:xsi (i.e. "http://www.w3.org/2001/XMLSchema-instance"), that way I don't have to explicitly write "http://www.w3.org/2001/XMLSchema-instance" in the nsmap but rather can include the variable? Does that make sense? – Adam Mar 02 '20 at 16:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208857/discussion-between-tomalak-and-adam). – Tomalak Mar 02 '20 at 16:37