18

I see there are similar questions here, but nothing that has totally helped me. I've also looked at the official documentation on namespaces but can't find anything that is really helping me, perhaps I'm just too new at XML formatting. I understand that perhaps I need to create my own namespace dictionary? Either way, here is my situation:

I am getting a result from an API call, it gives me an XML that is stored as a string in my Python application.

What I'm trying to accomplish is just grab this XML, swap out a tiny value (The b:string value user ConditionValue/Default but that's irrelevant to this question) and then save it as a string to send later on in a Rest POST call.

The source XML looks like this:

<Context xmlns="http://Test.the.Sdk/2010/07" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<xmlns i:nil="true" xmlns="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:a="http://schema.test.org/2004/07/System.Xml.Serialize"/>
<Conditions xmlns:a="http://schema.test.org/2004/07/Test.Soa.Vocab">
    <a:Condition>
        <a:xmlns i:nil="true" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize"/>
        <Identifier>a23aacaf-9b6b-424f-92bb-5ab71505e3bc</Identifier>
        <Name>Code</Name>
        <ParameterSelections/>
        <ParameterSetCollections/>
        <Parameters/>
        <Summary i:nil="true"/>
        <Instance>25486d6c-36ba-4ab2-9fa6-0dbafbcf0389</Instance>
        <ConditionValue>
            <ComplexValue i:nil="true"/>
            <Text i:nil="true" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
            <Default>
                <ComplexValue i:nil="true"/>
                <Text xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
                    <b:string>NULLCODE</b:string>
                </Text>
            </Default>
        </ConditionValue>
        <TypeCode>String</TypeCode>
    </a:Condition>
    <a:Condition>
        <a:xmlns i:nil="true" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize"/>
        <Identifier>0af860f6-5611-4a23-96dc-eb3863975529</Identifier>
        <Name>Content Type</Name>
        <ParameterSelections/>
        <ParameterSetCollections/>
        <Parameters/>
        <Summary i:nil="true"/>
        <Instance>6364ec20-306a-4cab-aabc-8ec65c0903c9</Instance>
        <ConditionValue>
            <ComplexValue i:nil="true"/>
            <Text i:nil="true" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
            <Default>
                <ComplexValue i:nil="true"/>
                <Text xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays">
                    <b:string>Standard</b:string>
                </Text>
            </Default>
        </ConditionValue>
        <TypeCode>String</TypeCode>
    </a:Condition>
</Conditions>

My job is to swap out one of the values, retaining the entire structure of the source, and use this to submit a POST later on in the application.

The problem that I am having is that when it saves to a string or to a file, it totally messes up the namespaces:

<ns0:Context xmlns:ns0="http://Test.the.Sdk/2010/07" xmlns:ns1="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:ns3="http://schemas.microsoft.com/2003/10/Serialization/Arrays" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns1:xmlns xsi:nil="true" />
<ns0:Conditions>
<ns1:Condition>
<ns1:xmlns xsi:nil="true" />
<ns0:Identifier>a23aacaf-9b6b-424f-92bb-5ab71505e3bc</ns0:Identifier>
<ns0:Name>Code</ns0:Name>
<ns0:ParameterSelections />
<ns0:ParameterSetCollections />
<ns0:Parameters />
<ns0:Summary xsi:nil="true" />
<ns0:Instance>25486d6c-36ba-4ab2-9fa6-0dbafbcf0389</ns0:Instance>
<ns0:ConditionValue>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text xsi:nil="true" />
<ns0:Default>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text>
<ns3:string>NULLCODE</ns3:string>
</ns0:Text>
</ns0:Default>
</ns0:ConditionValue>
<ns0:TypeCode>String</ns0:TypeCode>
</ns1:Condition>
<ns1:Condition>
<ns1:xmlns xsi:nil="true" />
<ns0:Identifier>0af860f6-5611-4a23-96dc-eb3863975529</ns0:Identifier>
<ns0:Name>Content Type</ns0:Name>
<ns0:ParameterSelections />
<ns0:ParameterSetCollections />
<ns0:Parameters />
<ns0:Summary xsi:nil="true" />
<ns0:Instance>6364ec20-306a-4cab-aabc-8ec65c0903c9</ns0:Instance>
<ns0:ConditionValue>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text xsi:nil="true" />
<ns0:Default>
<ns0:ComplexValue xsi:nil="true" />
<ns0:Text>
<ns3:string>Standard</ns3:string>
</ns0:Text>
</ns0:Default>
</ns0:ConditionValue>
<ns0:TypeCode>String</ns0:TypeCode>
</ns1:Condition>
</ns0:Conditions>

I've narrowed the code down to the most basic form and I'm still getting the same results so it's not anything to do with how I'm manipulating the file normally:

import xml.etree.ElementTree as ET
import requests

get_context_xml = 'http://localhost/testapi/returnxml' #returns first XML example above.
source_context_xml = requests.get(get_context_xml)

Tree = ET.fromstring(source_context_xml)

#Ensure the original namespaces are intact.
for Conditions in Tree.iter('{http://schema.test.org/2004/07/Test.Soa.Vocab}Condition'): 
    print "success"

with open('/home/memyself/output.xml','w') as f:
    f.write(ET.tostring(Tree))
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
emmdee
  • 1,541
  • 3
  • 25
  • 46
  • You tagged the question with "lxml". Did you try it? I think most if not all of the problems will go away if you do. lxml is similar to ElementTree, but leaves your namespaces alone. – mzjn Aug 10 '15 at 06:40

2 Answers2

21

You need to register the prefix and the namespace before you do fromstring() (Reading the xml) to avoid the default namespace prefixes (like ns0 and ns1 , etc.) .

You can use the ET.register_namespace() function for that, Example -

ET.register_namespace('<prefix>','http://Test.the.Sdk/2010/07')
ET.register_namespace('a','http://schema.test.org/2004/07/Test.Soa.Vocab')

You can leave the <prefix> empty if you do not want a prefix.


Example/Demo -

>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<ns0:a xmlns:ns0="blah">a</ns0:a>'
>>> ET.register_namespace('','blah')
>>> r = ET.fromstring('<a xmlns="blah">a</a>')
>>> ET.tostring(r)
b'<a xmlns="blah">a</a>'
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Thanks I'm confused on what values to set for the prefixes. Looking at all the declarations throughout the original XML, how can I correlate which prefix to assign to which namespace? `xmlns="http://Test.the.Sdk/2010/07" xmlns="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:a="http://schema.test.org/2004/07/System.Xml.Serialize" xmlns:a="http://schema.test.org/2004/07/Test.Soa.Vocab" xmlns:b="http://schemas.microsoft.com/2003/10/Serialization/Arrays" xmlns:b="http://schema.test.org/2004/07/System.Xml.Serialize" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"` – emmdee Aug 04 '15 at 10:45
  • Assign the prefix after the `:` to the namespace, if no such item in the `xmlns` line, then set the prefix as empty. Example - `b` for `http://schemas.microsoft.com/2003/10/Serialization/Arrays` and `b` for `http://schema.test.org/2004/07/System.Xml.Serialize` . But you can also specify your own prefixes, which are more readable (the source xml seems to be using same prefix for multiple namespaces , which though valid, may not be good for readability) . – Anand S Kumar Aug 04 '15 at 10:48
  • Unfortunately I can't get it to save in the exact same format it's opened as. Now it added a larger declaration of prefixes and kept the ns0 There is no way to make the ETree just keep the formatting the way it was opened? – emmdee Aug 04 '15 at 11:10
  • Is it still `ns0` and `ns1` ? And you did add the namesapces before reading the xml right? As suggested - *before you do fromstring() (Reading the xml)* – Anand S Kumar Aug 04 '15 at 11:11
  • Correct, the first lines in my script are: `ET.register_namespace('', 'http://Telestream.Vantage.Sdk/2010/07') ET.register_namespace('i', 'http://www.w3.org/2001/XMLSchema-instance') ET.register_namespace('', 'http://schemas.datacontract.org/2004/07/Telestream.Soa.Vocabulary') ET.register_namespace('b', 'http://schemas.datacontract.org/2004/07/System.Xml.Serialization') ET.register_namespace('a', 'http://schemas.datacontract.org/2004/07/System.Xml.Serialization') ET.register_namespace('a', 'http://schemas.datacontract.org/2004/07/Telestream.Soa.Vocabulary') ` – emmdee Aug 04 '15 at 11:13
  • I wish I could paste full XML for analysis. The character limit is hurting. It's adding the namespaces now but kept the ns0 `` – emmdee Aug 04 '15 at 11:20
  • I tested, for some reason for your xml, empty prefix is not working, try putting some meaningful names in their places. – Anand S Kumar Aug 04 '15 at 11:38
  • Thank you, that seemed to clear it up. There are a few spots that still don't look right but I'll continue to experiment and see if I can get them working. This whole prefix thing and just putting a random name in there fixing it just confuses me even more though haha. – emmdee Aug 04 '15 at 11:47
  • I will be sure to do that once I get a final working method. I just realized that you posted a demo/sample above so I'll explore that too. I'm still working on it as there is more to this component that needs to be in place before I can test the live XML's. I'll be sure to mark the proper answer once all is resolved. Thanks so much for your help so far – emmdee Aug 08 '15 at 07:52
0

First off, welcome to the StackOverflow network! Technically @anand-s-kumar is correct. However there was a minor misuse of the toString function, and the fact that namespaces might not always be known by the code or the same between tags or XML files. Also, inconsistencies between the lxml and xml.etree libraries and Python 2.x and 3.x make handling this difficult.

This function iterates through all of the children elements in the XML tree tree that is passed in, and then edits the XML tags to remove the namespaces. Note that by doing this, some data may be lost.

def remove_namespaces(tree):
    for el in tree.getiterator():
        match = re.match("^(?:\{.*?\})?(.*)$", el.tag)
        if match:
            el.tag = match.group(1)

I myself just ran into this problem, and hacked together a quick solution. I tested this on about 81,000 XML files (averaging around 150 MB each) that had this problem, and all of them were fixed. Note that this isn't exactly an optimal solution, but it is relatively efficient and worked quite well for me.

CREDIT: Idea and code structure originally from Jochen Kupperschmidt.

Community
  • 1
  • 1
andrewgu
  • 1,562
  • 14
  • 23
  • Thanks and very interesting. I am going to submit a POST through a REST API and I am not sure if the receiving node will accept it without namespaces. That would be ideal if it ignores them. I'll see what I can whip up. Thanks. – emmdee Aug 08 '15 at 07:54