0

Dears, I'm trying to parse some data from an xml file using python version3. This is my xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Created on Fri Sep 07 08:20:37 WAT 2018 with ROAMSMART IREG-360 // www.roam-smart.com -->
<tadig-raex-21:TADIGRAEXIR21 xmlns:tadig-raex-21="https://infocentre.gsm.org/TADIG-RAEX-IR21" xmlns:ns2="https://infocentre.gsm.org/TADIG-GEN">
    <tadig-raex-21:RAEXIR21FileHeader>
        <tadig-raex-21:FileCreationTimestamp>2018-01-08T15:42:21+01:00</tadig-raex-21:FileCreationTimestamp>
        <tadig-raex-21:FileType>IR.21</tadig-raex-21:FileType>
        <tadig-raex-21:SenderTADIG>DEMO</tadig-raex-21:SenderTADIG>
        <tadig-raex-21:PublishComment>Update</tadig-raex-21:PublishComment>
        <tadig-raex-21:TADIGGenSchemaVersion>2.4</tadig-raex-21:TADIGGenSchemaVersion>
        <tadig-raex-21:TADIGRAEXIR21SchemaVersion>10.1</tadig-raex-21:TADIGRAEXIR21SchemaVersion>
    </tadig-raex-21:RAEXIR21FileHeader>
    <tadig-raex-21:OrganisationInfo>
        <tadig-raex-21:OrganisationName>DEMO</tadig-raex-21:OrganisationName>
        <tadig-raex-21:CountryInitials>FRA</tadig-raex-21:CountryInitials>
        <tadig-raex-21:NetworkList>
            <tadig-raex-21:Network>
                <tadig-raex-21:TADIGCode>DEMO</tadig-raex-21:TADIGCode>
                <tadig-raex-21:NetworkType>Terrestrial</tadig-raex-21:NetworkType>
                <tadig-raex-21:NetworkData>
                    <tadig-raex-21:IPRoaming_IW_InfoSection>
                        <tadig-raex-21:IPRoaming_IW_Info_General>
                            <tadig-raex-21:EffectiveDateOfChange>2013-07-01</tadig-raex-21:EffectiveDateOfChange>
                            <tadig-raex-21:PMNAuthoritativeDNSIPList>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.11</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>PMASDNS1.mnc001.mcc208.gprs</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.74</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>LYLADNS1.mnc001.mcc208.gprs</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.11</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>PMASDNS1.mnc001.mcc208.3gppnetwork.org</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                                <tadig-raex-21:DNSitem>
                                    <tadig-raex-21:IPAddress>212.234.96.74</tadig-raex-21:IPAddress>
                                    <tadig-raex-21:DNSname>LYLADNS1.mnc001.mcc208.3gppnetwork.org</tadig-raex-21:DNSname>
                                </tadig-raex-21:DNSitem>
                            </tadig-raex-21:PMNAuthoritativeDNSIPList>
                        </tadig-raex-21:IPRoaming_IW_Info_General>
                    </tadig-raex-21:IPRoaming_IW_InfoSection>
                </tadig-raex-21:NetworkData>
                <tadig-raex-21:HostedNetworksInfo>
                    <tadig-raex-21:SectionNA>Section not applicable</tadig-raex-21:SectionNA>
                </tadig-raex-21:HostedNetworksInfo>
                <tadig-raex-21:PresentationOfCountryInitialsAndMNN>DEMO FR</tadig-raex-21:PresentationOfCountryInitialsAndMNN>
                <tadig-raex-21:AbbreviatedMNN>DEMO</tadig-raex-21:AbbreviatedMNN>
                <tadig-raex-21:NetworkColourCode>1</tadig-raex-21:NetworkColourCode>
            </tadig-raex-21:Network>
        </tadig-raex-21:NetworkList>
    </tadig-raex-21:OrganisationInfo>
</tadig-raex-21:TADIGRAEXIR21>

I need to get all IP addresses from All DNS Items and save them to a list that will be exported in a csv file. The IP records will be associated with TADIG in each line.

I was inspiring from this link (Getting all instances of child node using xml.etree.ElementTree) Here is my code:

from xml.etree import ElementTree as ET

out = csv.writer(open("result.csv", "w"), delimiter=',', quoting=csv.QUOTE_ALL)
# loop through directory for and parse all xml file
directory = "C:\\Users\\Walid Ben Chamekh\\PycharmProjects\\dnsparser\\com\\ir21\\dnsparser\\"

# start parsing
print("Start parsing")
for filename in os.listdir(directory):
    if filename.endswith(".xml"):
        print(filename)
        root = ET.parse(filename).getroot()
        # get Network TADIG code
        raexFileHeader = root.getchildren()[0]
        tadig = raexFileHeader.getchildren()[2].text

        try:
            DNS = root.findall(
                ".//tadig-raex-21:OrganisationInfo/tadig-raex-21:NetworkList/tadig-raex-21:Network["
                "1]/tadig-raex-21:NetworkData/tadig-raex-21:IPRoaming_IW_InfoSection/tadig-raex-21"
                ":IPRoaming_IW_Info_General/tadig-raex-21:PMNAuthoritativeDNSIPList")
        except Exception:
            print("no data")
            continue

        # get all IPs from all dns items
        for item in DNS.getchildren():
            IPresult = [tadig]
            ip = item.getchildren()[0].text
            IPresult.append(ip)
            print(IPresult)
            out.writerow(IPresult)
        continue
    else:
        continue
# End Parsing
print("End Parsing")

It does not work, the DNS list is getting always empty!! Thank you for your help

1 Answers1

0

The problem is that ElementTree isn't very smart when it comes to namespaces. In calls to find(), findall() and iterfind() you need to pass a dict with namespaces, found in this answer: https://stackoverflow.com/a/14853417/2044940

namespaces = { "tadig-raex-21": "https://infocentre.gsm.org/TADIG-RAEX-IR21" }
root.findall("...", namespaces)

With this and a few other changes, I was able to make it return the following data:

['DEMO', '212.234.96.11']
['DEMO', '212.234.96.74']
['DEMO', '212.234.96.11']
['DEMO', '212.234.96.74']

Here is the Python script. Note that you need to give it a filename with the input XML:

from xml.etree import ElementTree as ET

# Doesn't help, it is only used for serialization, i.e. writing XML, but not parsing
#ET.register_namespace("tadig-raex-21", "https://infocentre.gsm.org/TADIG-RAEX-IR21")

# Dictionary of namespaces, needed to avoid error:
# -> SyntaxError: prefix 'tadig-raex-21' not found in prefix map
namespaces = {
    "tadig-raex-21": "https://infocentre.gsm.org/TADIG-RAEX-IR21"
}

root = ET.parse(filename).getroot()

# Fetch SenderTADIG by path
# TODO: handle case if the element doesn't exist
tadig = root.find(
    "tadig-raex-21:RAEXIR21FileHeader/"
    "tadig-raex-21:SenderTADIG", namespaces).text

# Select DNSitems for further processing
DNS = root.findall(
    "tadig-raex-21:OrganisationInfo/"
    "tadig-raex-21:NetworkList/"
    "tadig-raex-21:Network[1]/"
    "tadig-raex-21:NetworkData/"
    "tadig-raex-21:IPRoaming_IW_InfoSection/"
    "tadig-raex-21:IPRoaming_IW_Info_General/"
    "tadig-raex-21:PMNAuthoritativeDNSIPList/"
    "tadig-raex-21:DNSitem", namespaces)

# DNS is a list of elements, can't call getchildren() on it directly!
for item in DNS:
    IPresult = [tadig]
    # It's safer to fetch the IPAddress via the element name
    ip = item.find("tadig-raex-21:IPAddress", namespaces).text
    IPresult.append(ip)
    print(IPresult)

It is possible without the namespace dict as well, but then the full namespace URI needs to used in curly braces as prefix (found here):

tadig = root.find(
  "{https://infocentre.gsm.org/TADIG-RAEX-IR21}RAEXIR21FileHeader/"
  "{https://infocentre.gsm.org/TADIG-RAEX-IR21}SenderTADIG").text

Interestingly, it doesn't appear to be possible to determine attributes of the root element which have a namespace (which would potentially allow us to generate the namespaces dict from):

# Empty dict
ET.parse(filename).getroot().attrib

The root element carries namespace information:

<tadig-raex-21:TADIGRAEXIR21
   xmlns:tadig-raex-21="https://infocentre.gsm.org/TADIG-RAEX-IR21"
   xmlns:ns2="https://infocentre.gsm.org/TADIG-GEN">

You can't pass a namespaces dict to getroot(), so no idea if or how it's possible to get the values of the attributes xmlns:tadig-raex-21 and xmlns:ns2.

CodeManX
  • 11,159
  • 5
  • 49
  • 70
  • It actually works without the namespaces dict as well if you put the full namespace URI like this: `"{https://infocentre.gsm.org/TADIG-RAEX-IR21}RAEXIR21FileHeader/"`. It's kinda messy though. Please upvote and accept the answer if you're happy. – CodeManX Sep 10 '18 at 16:22
  • Above the check mark which you used to accept the answer, there are two triangles and a number. Click on the one pointing up to up-vote the answer. The number will increase by 1. – CodeManX Sep 14 '18 at 12:10