7

Is there any way to find a nonrecursive DOM subnode in Python using BeautifulSoup?

E.g. consider parsing a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

If I want to get groupId at the top level (specifically project->groupId, not project->parent->groupId), I use:

with open(pom) as pomHandle:
    soup = BeautifulSoup(pomHandle)

groupId = soup.groupid.text

But unfortunately, that finds the first physical occurrence of groupId in the file regardless of the hierarchy level, which is project->parent->groupId. I actually want to do a unrecursive find ONLY at a specific node level, not within its children. Is there a way to do it in BeautifulSoup?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
amphibient
  • 29,770
  • 54
  • 146
  • 240
  • Why are you using BeautifulSoup (an HTML parser) to parse well-formed XML? Python has a perfectly good XML parser. – Jim Garrison Jan 15 '14 at 20:42
  • this is why: http://stackoverflow.com/questions/21146417/simple-dom-traversing-in-python-using-xml-etree-elementtree/21146487?noredirect=1#comment31827467_21146487 – amphibient Jan 15 '14 at 20:44
  • because i don't wanna have to deal with the namespace BS, which is apparently not ignorable – amphibient Jan 15 '14 at 20:44

1 Answers1

8

You can search inside "project" node with recursive=False:

groupId = soup.project.find('groupid', recursive=False).text

Hope that helps.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195