I am using xpath to parse an xml file
from lxml import etree
example='''<div n="0001" type="car" xml:id="_3a327f0002">
<p xml:id="_3a327f0003">
1. A car is
<p xml:id="_3a327f0004"> - big, yellow and red;</p>
<p xml:id="_3a327f0005"> - has a big motor;</p>
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
</p>
</div>'''
I want to serialize the above XML file in the following way:
{"_3a327f0003": "1. A car is",
"_3a327f0004":"- big, yellow and red;"
"_3a327f0005":"- has a big motor;"
"_3a327f0006":"- and also has big seats"
Basically extracting the text and building a dictionary where every text belongs to his xml:id
. My code is as follows:
parser = etree.XMLParser(resolve_entities=False, strip_cdata=False, recover=True, ns_clean=True)
XML_tree = etree.fromstring(example.encode() , parser=parser)
all_paras = XML_tree.xpath('.//p[@xml:id]')
list_of_paragraphs = []
for para in all_paras:
mydict = {}
mydict['text'] = para.text
for att in para.attrib:
mykey=att
if 'id' in mykey:
mykey='xmlid'
mydict[mykey] = para.attrib[att]
list_of_paragraphs.append(mydict)
PDM_XML_serializer(example)
It works except for the fact that if I have a node like:
<p xml:id="_3a327f0006"> - and also has <lb/>
big seats.
</p>
It will not extract the part from
How should I modify:
XML_tree.xpath('.//p[@xml:id]')
in order to get all the text from <p to /p> ?
EDIT: para.itertext() could be used but then the first node will give back all the text of the other nodes as well.