0

I know this question is quite common, but my example below is a bit more complex than the title of the question suggests.

Suppose I've got the following "test.xml" file:

<?xml version="1.0" encoding="UTF-8"?>
<test:xml xmlns:test="http://com/whatever/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <parent xsi:type="parentType">
    <child xsi:type="childtype">
      <grandchild>
        <greatgrandchildone>greatgrandchildone</greatgrandchildone>
        <greatgrandchildtwo>greatgrandchildtwo</greatgrandchildtwo>
      </grandchild><!--random comment -->
    </child>
    <child xsi:type="childtype">
      <greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
      <greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--another random comment -->
    </child>
    <child xsi:type="childtype">
      <greatgrandchildthree>greatgrandchildthree</greatgrandchildthree>
      <greatgrandchildfour>greatgrandchildfour</greatgrandchildfour><!--third random comment -->
    </child>
  </parent>
</test:xml>

Within my program below, I'm doing two main things:

  1. Find out all the nodes in the xml that contain a "type" attribute
  2. Loop through each node of the xml and find out if it is a child of an element that contains a "type" attribute

This is my code:

from lxml import etree
import re

xmlDoc = etree.parse("test.xml")
root = xmlDoc.getroot()

nsmap = {
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance'
}

nodesWithType = []

def check_type_in_path(nodesWithType, path, root):
    typesInPath = []
    elementType = ""

    for node in nodesWithType:
        print("checking node: ", node, " and path: ", path)

        if re.search(r"\b{}\b".format(
            node), path, re.IGNORECASE) is not None:

            element = root.find('.//{0}'.format(node))
            elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
            if elementType is not None:
                print("found an element for this path. adding to list")
                typesInPath.append(elementType)
        else:
            print("element: ", node, " not found in path: ", path)

    print("path ", path ," has types: ", elementType)
    print("-------------------")
    return typesInPath

def get_all_node_types(xmlDoc):
    nodesWithType = []
    root = xmlDoc.getroot()

    for node in xmlDoc.iter():

        path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])

        if "COMMENT" not in path.upper():
            element = root.find('.//{0}'.format(path))
            elementType = element.attrib.get(f"{{{nsmap['xsi']}}}type")
            if elementType is not None:
                nodesWithType.append(path)

    return nodesWithType

nodesWithType = get_all_node_types(xmlDoc)
print("nodesWithType: ", nodesWithType)

for node in xmlDoc.xpath('//*'):
    path = "/".join(xmlDoc.getpath(node).strip("/").split('/')[1:])
    typesInPath = check_type_in_path(nodesWithType, path, root)

The code should return all the types that are contained within a certain path. For example, consider the path parent/child[3]/greatgrandchildfour. This path is a child (either direct or distant) of two nodes that contain the attribute "type": parent and parent/child[3]. I would therefore expect the nodesWithType array for that particular node to include both "parentType" and "childtype".

However, based off the below prints, the nodesWithType array for this node only includes the "parentType" type and doesn't include "childtype". The main focus of this logic is checking whether the path to the node with the type is included in path to the node in question (hence checking for the exact match of the string). But this is clearly not working. I'm not sure if it's because there are array annotations within the condition that's not validating it, or perhaps something else.

For the above example, the returned prints are:

checking node:  parent  and path:  parent/child[3]/greatgrandchildfour
found an element for this path. adding to list
checking node:  parent/child[1]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[1]  not found in path:  parent/child[3]/greatgrandchildfour
checking node:  parent/child[2]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[2]  not found in path:  parent/child[3]/greatgrandchildfour
checking node:  parent/child[3]  and path:  parent/child[3]/greatgrandchildfour
element:  parent/child[3]  not found in path:  parent/child[3]/greatgrandchildfour
path  parent/child[3]/greatgrandchildfour  has types:  parentType
Adam
  • 2,384
  • 7
  • 29
  • 66
  • Hello again, Adam! I'm a little confused: given the xml in your question, what exactly is your desired output? – Jack Fleeting Mar 31 '20 at 15:45
  • Thanks @JackFleeting for checking my question again! I actually posted my issue in a slighter less complex way here: https://stackoverflow.com/q/60953466/3480297 – Adam Mar 31 '20 at 16:57

0 Answers0