2

I'm trying to find a tag using xml.etree.ElementTree. I don't know the exact position so I've to search for it.

The input are NuGet-Specifications for .Net NuGet packages.

I used this code to find the element but it doesn't find it:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)

# none of the following lines are working
tag = tree.find('licenseUrl')
tags = tree.findall('*/licenseUrl')
tags = tree.findall('.//licenseUrl')
tags = tree.findall('licenseUrl')

But len(tags) is always 0.

If I'm using regex to find it, it works like a charm:

re.search(r'<licenseUrl>(?P<url>.*?)</licenseUrl>', content, flags=re.DOTALL or re.MULTILINE) 

But it's not recommended to use regex to parse xml.

What am I doing wrong?

DEMO that shows the working code.

I was using the following information without luck:

For completeness the content of content:

<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
  <metadata>
    <id>AutoMapper</id>
    <version>9.0.0</version>
    <authors>Jimmy Bogard</authors>
    <owners>Jimmy Bogard</owners>
    <requireLicenseAcceptance>false</requireLicenseAcceptance>
    <licenseUrl>https://github.com/AutoMapper/AutoMapper/blob/master/LICENSE.txt</licenseUrl>
    <projectUrl>https://automapper.org/</projectUrl>
    <iconUrl>https://s3.amazonaws.com/automapper/icon.png</iconUrl>
    <description>A convention-based object-object mapper.</description>
    <repository type="git" url="https://github.com/AutoMapper/AutoMapper" commit="53faf3f014802b502f6a49b4c94368f478752f59" />
    <dependencies>
      <group targetFramework=".NETFramework4.6.1" />
      <group targetFramework=".NETStandard2.0">
        <dependency id="Microsoft.CSharp" version="4.5.0" exclude="Build,Analyzers" />
        <dependency id="System.Reflection.Emit" version="4.3.0" exclude="Build,Analyzers" />
      </group>
    </dependencies>
    <frameworkAssemblies>
      <frameworkAssembly assemblyName="Microsoft.CSharp" targetFramework=".NETFramework4.6.1" />
    </frameworkAssemblies>
  </metadata>
</package>
Sebastian Schumann
  • 3,204
  • 19
  • 37
  • 1
    You are not taking the namespace into account. See https://docs.python.org/3/library/xml.etree.elementtree.html#parsing-xml-with-namespaces – mzjn Apr 23 '20 at 11:49
  • @mzjn Thx. Maybe that's the problem but how do I ignore them? I don't care about the namespace and I'm pretty sure that they differ between xmls of different versions. – Sebastian Schumann Apr 23 '20 at 11:52
  • 1
    There have been many questions about processing XML with namespaces. See for example https://stackoverflow.com/a/61154644/407651 – mzjn Apr 23 '20 at 11:55
  • @mzjn Okay. Support was added in 3.8. I've to test it. – Sebastian Schumann Apr 23 '20 at 12:01
  • Trilliput's answer is OK, but IMHO this question should be closed as a duplicate. So many similar questions have already been asked. More examples: https://stackoverflow.com/q/20435500/407651, https://stackoverflow.com/q/14853243/407651 – mzjn Apr 23 '20 at 12:07
  • 1
    @mzjn Thx. The support for `.//{*}xxx` works. – Sebastian Schumann Apr 23 '20 at 12:10

1 Answers1

0

Your XML has a default name space which you are not taking in account. This code should work:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)
ns = {'ms': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
tags = tree.findall('.//ms:licenseUrl', ns)

UPDATE: Or, as @mzjn mentioned in the comments, just use {*} if you really don't care about the name spaces:

import xml.etree.ElementTree as ET

content = ......

tree = ET.fromstring(content)
tags = tree.findall('.//{*}licenseUrl')
Shmygol
  • 913
  • 7
  • 16