I know this question is quite common, but my example below is a bit more complex than the title of the question suggests.
Suppose I've got the following "test.xml" file:
<?xml version="1.0" encoding="UTF-8"?>
<test:xml xmlns:test="http://com/whatever/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<parent xsi:type="parentType">
<child xsi:type="childtype">
<grandchild>
<greatgrandchildone>greatgrandchildone</greatgrandchildone>
<greatgrandchildtwo>greatgrandchildtwo</greatgrandchildtwo>
</grandchild><!--random comment -->
</child>
<child xsi:type="childtype">
<greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
<greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--another random comment -->
</child>
<child xsi:type="childtype">
<greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
<greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--third random comment -->
</child>
</parent>
</test:xml>
Within my program below, I'm doing two main things:
- Find out all the nodes in the xml that contain a "type" attribute
- Loop through each node of the xml and find out if it is a child of an element that contains a "type" attribute
This is my code:
from lxml import etree
import re
xmlDoc = etree.parse("test.xml")
root = xmlDoc.getroot()
nsmap = {
'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}
nodesWithType = []
def check_type_in_path(nodesWithType, path, root):
typesInPath = []
elementType = ""
for node in nodesWithType:
print("checking node: ", node, " and path: ", path)
if re.search(r"\b{}\b".format(
node), path, re.IGNORECASE) is not None:
element = root.find('.//{0}'.format(node))
elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
if elementType is not None:
print("found an element for this path. adding to list")
typesInPath.append(elementType)
else:
print("element: ", node, " not found in path: ", path)
print("path ", path ," has types: ", elementType)
print("-------------------")
return typesInPath
def get_all_node_types(xmlDoc):
nodesWithType = []
root = xmlDoc.getroot()
for node in xmlDoc.iter():
path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
if "COMMENT" not in path.upper():
element = root.find('.//{0}'.format(path))
elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
if elementType is not None:
nodesWithType.append(path)
return nodesWithType
nodesWithType = get_all_node_types(xmlDoc)
print("nodesWithType: ", nodesWithType)
for node in xmlDoc.xpath('//*'):
path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
typesInPath = check_type_in_path(nodesWithType, path, root)
The code should return all the types that are contained within a certain path. For example, consider the path parent/child[3]/greatgrandchildfour
. This path is a child (either direct or distant) of two nodes that contain the attribute "type": parent
and parent/child[3]
. I would therefore expect the nodesWithType
array for that particular node to include both "parentType" and "childtype".
However, based off the below prints, the nodesWithType
array for this node only includes the "parentType" type and doesn't include "childtype". The main focus of this logic is checking whether the path to the node with the type is included in path to the node in question (hence checking for the exact match of the string). But this is clearly not working. I'm not sure if it's because there are array annotations within the condition that's not validating it, or perhaps something else.
For the above example, the returned prints are:
checking node: parent and path: parent/child[3]/greatgrandchildfour
found an element for this path. adding to list
checking node: parent/child[1] and path: parent/child[3]/greatgrandchildfour
element: parent/child[1] not found in path: parent/child[3]/greatgrandchildfour
checking node: parent/child[2] and path: parent/child[3]/greatgrandchildfour
element: parent/child[2] not found in path: parent/child[3]/greatgrandchildfour
checking node: parent/child[3] and path: parent/child[3]/greatgrandchildfour
element: parent/child[3] not found in path: parent/child[3]/greatgrandchildfour
path parent/child[3]/greatgrandchildfour has types: parentType