I have xmlfiles like this one:
<bs-submission participant-id="tagger1" run-id="first annotations with the prospectus tagger" task="book-toc" toc-creation="manual" toc-source="full-content">
<source-files pdf="yes" xml="no"/>
<description>
This file contains the **manual** annotations of tagger1 for one single prospectus
</description>
<book>
<bookid>35A6A54497295928</bookid>
<toc-section page="13"/>
<toc-section page="14"/>
<toc-section page="15"/>
<toc-section page="16"/>
<toc-entry title="ESSAY I. On RIDICULE considered as a Test of Truth." page="17">
<toc-entry title="I. VINDICATION of the noble Writer's Zeal for Freedom." page="17"/>
<toc-entry title="II. Of bis Method of treating the Question concerning Ridicule." page="23"/>
<toc-entry title="III. Of the different Kinds of Composition; Poetry, Eloquence, and Argument." page="28"/>
<toc-entry title="IV. That Ridicule is a Species of Eloquence." page="57"/>
<toc-entry title="V. A Confirmation of the foregoing Truths by an Appeal to Fact." page="64"/>
<toc-entry title="VI. Of the noble Writer's Arguments in support of his new Theory; particularly the Case of SOCRATES." page="70"/>
<toc-entry title="VII. His" page="80"/>
<toc-entry title="VII. His further Reasonings examined." page="80"/>
<toc-entry title="VIII. Of his main Argument; relating to Protestantism and Christianity" page="90"/>
<toc-entry title="IX. Of the Opinion of GORGIAS quoted by his Lordship from ARISTOTLE." page="97"/>
<toc-entry title="X. The Reasoning of one of his Followers in this Subject, examined." page="104"/>
<toc-entry title="XI. Of the particular Impropriety of applying Ridicule to the Investigation of religious Truth." page="115"/>
</toc-entry>
<toc-entry title="ESSAY II. On the Motives to VIRTUE, and the Necessity of Religious Principle." page="125">
<toc-entry title="I. Introduction." page="125"/>
<toc-entry title="II. That the Definitions which Lord SHAFTESBURY, and several other Moralists have given of Virtue, are inadequate and defective." page="127"/>
<toc-entry title="III. Of the real Nature of Virtue." page="139"/>
<toc-entry title="IV. Of" page="153"/>
<toc-entry title="IV. Of an Objection urged by Dr. MAN-DEVILLE against the permanent Reality of Virtue." page="153"/>
<toc-entry title="V. Examination and Analysis of The Fable of the Bees." page="162"/>
<toc-entry title="VI. Of the natural Motives to virtuous Action." page="174"/>
<toc-entry title="VII. How far these Motives can in Reality influence all Mankind. The Errors of the Stoic and Epicurean Parties; and the most probable Foundation of these Errors." page="183"/>
<toc-entry title="VIII. The noble Writer's additional Reasonings examined; and shown to be without Foundation." page="203"/>
<toc-entry title="IX. That the religious Principle, or Obedience to the Will of God, can alone produce a uniform and permanent Motive to Virtue. The noble Writer's Objections examined." page="222"/>
<toc-entry title="X. Of the Efficacy of the religious Principle. Conclusion." page="239"/>
</toc-entry>
<toc-entry title="ESSAY III. On Revealed RELIGION, and CHRISTIANITY." page="257">
<toc-entry title="I. Of the noble Writer's Manner of treating Christianity." page="257"/>
<toc-entry title="II. Of his Objections to the Truths of natural Religion." page="261"/>
<toc-entry title="III. Of the Credibility of the Gospel-History." page="272"/>
<toc-entry title="IV. Of the Scripture-Miracles" page="287"/>
<toc-entry title="V. Of Enthusiasm." page="310"/>
<toc-entry title="VI. Of the religious and moral Doctrines of Christianity." page="330"/>
<toc-entry title="VII. Of several detached Passages in the Characteristics." page="365"/>
<toc-entry title="VIII. Of the Style and Composition of the Scriptures." page="387"/>
<toc-entry title="IX. Of the noble Writer's Treatment of the English Clergy." page="407"/>
</toc-entry>
</book>
</bs-submission>
As you can see, the toc-entry
elements are hierarchichal. In some xmls, they go up to 5 or 6 in depth.
I would like to write a function that takes as input such content and outputs the same file while keeping only toc-entry
's that have depth lower or equal than a specified integer max_depth
.
I am using two libraries which I found useful for handling xml files:
xml.etree.ElementTree
lxml.html
I first used xml.etree.ElementTree
only, but when I wanted to compute the depth of a toc-entry, I found a function that uses the second library, so I started using it too.
The function computing the depth is the following (node
is a lxml.html
object):
def depth(node):
'''
taken from:
https://stackoverflow.com/questions/17275524/xml-etree-elementtree-get-node-depth
'''
d = 0
while node is not None:
d += 1
node = node.getparent()
# return d
return d - 4
Here is an attempt to write the function I need (xml.etree.ElementTree is imported as ET):
# book is a ET.Element
book_lxml = lxml.html.fromstring(ET.tostring(book))
for toc_entry in book_lxml.iter('toc-entry'):
if depth(toc_entry) > max_depth:
try:
toc_entry.getparent().remove(toc_entry)
print("removed")
except ValueError:
print("kept")
else:
print("kept")
root.append(ET.fromstring(lxml.html.tostring(book_lxml)))
# root is a ET.Element containing the bs-submission span
The problem with this is that the hierarchy get missed up (some toc entries that have depth <= max_depth have a depth in the output that is different from the depth they had in the input).
Any ideas of how to output in a proper way a xmlfile identical to the input except that only toc-entry
's with depth <= max_depth are kept ?