Extracting nested namespace from a xml using lxml

Question

I'm new to Python and currently learning to parse XML. All seems to be going well until I hit a wall with nested namespaces.

Below is an snippet of my xml ( with a beginning and child element that I'm trying to parse:

<?xml version="1.0" encoding="UTF-8"?>
-<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
-------------
-------------
------------- 
-<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#"><Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id><EditRate>24 1</EditRate><IntrinsicDuration>2698</IntrinsicDuration></cc-cpl:MainClosedCaption>
------------
------------
------------
</CompositionPlaylist>

What I'm need is a solution to extract the URI of the local name 'MainClosedCaption'. In this case, I'm trying to extract the string "http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#". I looked through a lot of tutorials but cannot seems to find a solution.

If there's anyone out there can lend your expertise, it would be much appreciated.

Here what I did so far with the help from the two contributors:

#!/usr/bin/env python

from xml.etree import ElementTree as ET #import ElementTree module as an alias ET
from lxml import objectify, etree

def parse():

import os
import sys
cpl_file = sys.argv[1]
xml_file = os.path.abspath(__file__)
xml_file = os.path.dirname(xml_file)
xml_file = os.path.join(xml_file,cpl_file)

with open(xml_file)as f:
    xml = f.read()

tree = etree.XML(xml)

caption_namespace = etree.QName(tree.find('.//{*}MainClosedCaption')).namespace

print caption_namespace
print tree.nsmap

nsmap = {}

for ns in tree.xpath('//namespace::*'):
    if ns[0]:
        nsmap[ns[0]] = ns[1]
tree.xpath('//cc-cpl:MainClosedCaption', namespace=nsmap)

return nsmap


if __name__=="__main__":

parse()

But it's not working so far. I got the result 'None' when I used QName to locate the tag and its namespace. And when I try to locate all namespace in the XML using for loop as suggested in another post, I got the error 'Unknown return type: dict'

Any suggestions pls?

I'm not following your description. In this example, exactly what string are you trying to extract? — David, May 08 '15 at 00:01
I'm tryng to extract the namespace of the associated with the tag 'MainClosedCaption' — Daniel Tan, May 08 '15 at 00:21
In this case, the string that I'm trying to extract from the xml is 'http://www.digicine.com/PROTO- ASDCP-CC-CPL-20070926#' — Daniel Tan, May 08 '15 at 00:22
I found this [solution](http://stackoverflow.com/questions/4210730/how-do-i-use-xml-namespaces-with-find-findall-in-lxml) that might be helpful. — David, May 08 '15 at 01:20
@DanielTan Post some codes showing what you have tried so far. It is always easier for people to suggest solution based on what you have, instead of starting over from scratch. And usually, that kind of solution is easier for asker to understand too. — har07, May 08 '15 at 01:23

score 2 · Answer 1 · answered May 08 '15 at 02:35

2

This program prints the namespace of the indicated tag:

from lxml import etree

xml = etree.XML('''<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
<!-- Generated by orca_wrapping version 3.8.3-0 -->
<Id>urn:uuid:e0e43007-ca9b-4ed8-97b9-3ac9b272be7a</Id>
<cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
<Id>urn:uuid:0607e57f-edcc-46ec- 997a-d2fbc0c1ea3a</Id>
<EditRate>24 1</EditRate>
<IntrinsicDuration>2698</IntrinsicDuration>
</cc-cpl:MainClosedCaption>
</CompositionPlaylist>
''')

print etree.QName(xml.find('.//{*}MainClosedCaption')).namespace

Result:

http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#

Reference: http://lxml.de/tutorial.html#namespaces

answered May 08 '15 at 02:35

Robᵩ

163,533
20
239
308

I did what you suggested but got 'None' as a result. Please see my original post for my codes. – Daniel Tan May 08 '15 at 18:29
When I run the code in your question against the XML in your question, I get `http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#`. (Of course, I have to fix the typos in your XML first.) Perhaps the XML snippet in your question doesn't represent the XML you are actually using? – Robᵩ May 08 '15 at 19:15
the complete XML is different with more child elements with the root tag. But I have also copied the exact code that you pasted here and I get 'None' as well. – Daniel Tan May 08 '15 at 22:07
I'm sorry, but I have no idea why we would each get different output from the exact same program. – Robᵩ May 09 '15 at 02:13
By the way, Rob's suggestion worked for me. I'm currently having difficulty extract the //MainClosedCaption/Id element. http://stackoverflow.com/questions/37038148/extract-value-from-element-when-second-namespace-is-used-in-lxml – Tandy Freeman May 04 '16 at 21:23

Extracting nested namespace from a xml using lxml

1 Answers1

Linked