python xpath parsing of xml avoiding

Question

I am using xpath to parse an xml file

from lxml import etree

example='''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''

I want to serialize the above XML file in the following way:

{"_3a327f0003": "1. A car is",
 "_3a327f0004":"- big, yellow and red;"
 "_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"

Basically extracting the text and building a dictionary where every text belongs to his xml:id. My code is as follows:

parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[@xml:id]')

list_of_paragraphs = []
for para in all_paras:
    mydict = {}
    mydict['text'] = para.text
    for att in para.attrib:
        mykey=att
        if 'id' in mykey:
            mykey='xmlid'
        mydict[mykey] = para.attrib[att]
    list_of_paragraphs.append(mydict)

PDM_XML_serializer(example)

It works except for the fact that if I have a node like:

<p xml:id="_3a327f0006"> - and also has <lb/>
                        big seats.
                      </p>

It will not extract the part from

How should I modify:

XML_tree.xpath('.//p[@xml:id]')

in order to get all the text from <p to /p> ?

EDIT: para.itertext() could be used but then the first node will give back all the text of the other nodes as well.

AttributeError: 'lxml.etree._Element' object has no attribute 'text_content' — JFerro, May 25 '21 at 10:02
actually when doing print(dir(para)) I get the list of the methods not including text_content — JFerro, May 25 '21 at 10:07
Weird, because in [docs](https://lxml.de/lxmlhtml.html) this method exists. Ah, my bad, you should use html parser `lxml.html.fromstring()`. — Olvin Roght, May 25 '21 at 10:11
indeed the method is in the documentation, thats really weird — JFerro, May 25 '21 at 11:59

score 2 · Answer 1 · answered May 25 '21 at 15:19

Using xml.etree.ElementTree

import xml.etree.ElementTree as ET

xml = '''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''


def _get_element_txt(element):
    txt = element.text
    children = list(element)
    if children:
        txt += children[0].tail.strip()
    return txt


root = ET.fromstring(xml)
data = {p.attrib['{http://www.w3.org/XML/1998/namespace}id']: _get_element_txt(p)
        for p in root.findall('.//p/p')}
for k, v in data.items():
    print(f'{k} --> {v}')

output

_3a327f0004 -->  - big, yellow and red;
_3a327f0005 -->  - has a big motor;
_3a327f0006 -->  - and also has big seats.

Parfait · Answer 2 · 2021-05-25T22:11:44.443

Using lxml.etree parse all elements in all_paras in a list/dict comprehension. Since your XML uses the special xml prefix and lxml does not yet support parsing namespace prefix in attributes (see @mzjn's answer here), below uses workaround with next + iter to retrieve attribute value.

Additionally, to retrieve all text values between nodes, xpath("text()") is used with str.strip and .join to clean up whitespace and line breaks and concatenate together.

from lxml import etree

example='''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''
                
XML_tree = etree.fromstring(example)
all_paras = XML_tree.xpath('.//p[@xml:id]')

output = {
    next(iter(t.attrib.values())):" ".join(i.strip() 
        for i in t.xpath("text()")).strip()
    for t in all_paras
}

output
# {
#  '_3a327f0003': '1. A car is', 
#  '_3a327f0004': '- big, yellow and red;',
#  '_3a327f0005': '- has a big motor;',
#  '_3a327f0006': '- and also has big seats.'
# }

Hi @Parfait, could you have a relook, I made a mistake and I forgot that there was a line missing in my desired output, since I also want the key/value pair in the dict with key "_3a327f0003", thanks — JFerro, May 25 '21 at 21:49

score 0 · Answer 3 · answered May 25 '21 at 10:02

0

You could use lxml itertext() to get text content of the p element:

mydict['text'] = ''.join(para.itertext())

See this question as well for more generic solution.

answered May 25 '21 at 10:02

Alexandra Dudkina

4,302
3
15
27

This is a semi solution, since when having nested p tags the most outer tag will include the text of the inner tags as well. – JFerro May 25 '21 at 12:00

score 0 · Answer 4 · answered May 25 '21 at 17:35

This modifies the xpath to exclude the "A car is" text as per your example. It also uses the xpath functions string and normalize-space to evaluate the para node as a string and join its text nodes, as well as clean up the text to match your example.

from lxml import etree

example='''<div n="0001" type="car" xml:id="_3a327f0002">
                <p xml:id="_3a327f0003">
                1. A car is
                    <p xml:id="_3a327f0004"> - big, yellow and red;</p>
                    <p xml:id="_3a327f0005"> - has a big motor;</p>
                    <p xml:id="_3a327f0006"> - and also has <lb/>
                      big seats.
                    </p>
                </p>
                </div>'''

parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)

XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('./p/p[@xml:id]')

list_of_paragraphs = []
for para in all_paras:
    mydict = {}
    mydict['text'] = para.xpath('normalize-space(string(.))')
    for att in para.attrib:
        mykey=att
        if 'id' in mykey:
            mykey='xmlid'
        mydict[mykey] = para.attrib[att]
    list_of_paragraphs.append(mydict)

PDM_XML_serializer(example)

score 0 · Answer 5 · answered Jun 06 '21 at 09:49

0

If these tags are just noise for you, you can simply remove them before reading the xml

XML_tree = etree.fromstring(example.replace('<lb/>', '').encode() , parser=parser)

answered Jun 06 '21 at 09:49

Christoph Weiss-Schaber

386
2
10

python xpath parsing of xml avoiding

5 Answers5