Retain namespace prefix in a tag when parsing xml using lxml

Question

I have an xml as below. There are few tags which are prefixed with ce for example <ce:title>. When I run the code as below with xpath, in output, <ce:title> is replaced with <title>. I did see other links on SO like How to preserve namespace information when parsing HTML with lxml? but not sure where and how to add namespace details.

Can someone please suggest ? How can I retain <ce:title> for below xml?

from lxml import html
from lxml.etree import tostring
with open('102277033304.xml', encoding='utf-8') as file_object:
    xml = file_object.read().strip()
    root = html.fromstring(xml)
    for element in root.xpath('//item/book/pages/*'):
        html = tostring(element, encoding='utf-8')
        print(html)

XML:

<item>
    <book>
        <pages>
            <page-info>
                <page>
                  <ce:title>Chapter 1</ce:title>
                  <content>Welcome to Chapter 1</content>
                </page>
                <page>
                 <ce:title>Chapter 2</ce:title>
                 <content>Welcome to Chapter 2</content>
                </page>
            </page-info>
            <page-fulltext>Published in page 1</page-fulltext>
            <page-info>
                <page>
                  <ce:title>Chapter 1</ce:title>
                  <content>Welcome to Chapter 1</content>
                </page>
                <page>
                 <ce:title>Chapter 2</ce:title>
                 <content>Welcome to Chapter 2</content>
                </page>
            </page-info>
            <page-fulltext>Published in page 2</page-fulltext>
            <page-info>
                <page>
                  <ce:title>Chapter 1</ce:title>
                  <content>Welcome to Chapter 1</content>
                </page>
                <page>
                 <ce:title>Chapter 2</ce:title>
                 <content>Welcome to Chapter 2</content>
                </page>
            </page-info>
            <page-fulltext>Published in page 3</page-fulltext>
        </pages>
    </book>
</item>

The "XML" in the question is not really XML since there is no namespace declaration for the `ce` prefix (such as `xmlns:ce="http://example.com"`). — mzjn, Aug 05 '20 at 15:39

score 1 · Accepted Answer · answered Aug 05 '20 at 15:08

1

That's probably caused by the fact that you are using an html parser to read xml.

Try it like this:

from lxml import etree
root = etree.XML(xml)
for element in root.xpath('//item/book/pages/*'):
        xml = etree.tostring(element, encoding='utf-8')
        print(xml)

This should give you the expected output.

answered Aug 05 '20 at 15:08

Jack Fleeting

24,385
6
23
45

1

This won't work. The `ce` prefix is not declared so the document in the question is not well-formed XML. – mzjn Aug 05 '20 at 15:29
1

@mzjn True, I just assumed that there is a declaration in OP's actual xml. Maybe I should have made it explicit.... – Jack Fleeting Aug 05 '20 at 15:32
Yeah, I added tag `` to the xml and then instead of using html I used `etree.fromstring()` which fixed the issue. – Shankar Guru Aug 05 '20 at 15:52

Retain namespace prefix in a tag when parsing xml using lxml

1 Answers1