Find All Elements Given Namespaced Attribute

Question

If I have something like this:

<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>

How would I get beautifulsoup to select elements with an attribute of the foo namespace?

E.g. I would like the 2nd and 3rd p elements returned.

Inbar Rose · Answer 1 · 2013-03-06T10:02:47.323

From the documentation:

Beautiful Soup provides a special argument called attrs which you can use in these situations. attrs is a dictionary that acts just like the keyword arguments:

soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll(attrs={'id' : re.compile("para$")})
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

You can use attrs if you need to put restrictions on attributes whose names are Python reserved words, like class, for, or import; or attributes whose names are non-keyword arguments to the Beautiful Soup search methods: name, recursive, limit, text, or attrs itself.

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.findAll(name="Alice")
# []

xmlSoup.findAll(attrs={"name" : "Alice"})
# [parent rel="mother" name="Alice"></parent>]

So for your given example:

soup.findAll(attrs={ "foo" : re.compile(".*") })
# or
soup.findAll(attrs={ re.compile("foo:.*") : re.compile(".*") })

unutbu · Answer 2 · 2013-03-06T11:03:31.723

BeautifulSoup (both version 3 and 4) does not appear to treat the namespace-prefix as anything special. It just treats tho namespace-prefix and namespaced attribute as an attribute that happens to have a colon in its name.

So to find as <p> elements with attributes in the foo namespace, you just have to loop through all the attribute keys and check if attr.startswith('foo'):

import BeautifulSoup as bs
content = '''\
<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>'''

soup = bs.BeautifulSoup(content)
for p in soup.find_all('p'):
    for attr in p.attrs.keys():
        if attr.startswith('foo'):
            print(p)
            break

yields

<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>

With lxml you can search by XPath, which does have syntax support for searching for attributes by namespace:

import lxml.etree as ET
content = '''\
<root xmlns:foo="bar">
<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p></root>'''

root = ET.XML(content)
for p in root.xpath('p[@foo:*]', namespaces={'foo':'bar'}):
    print(ET.tostring(p))

yields

<p xmlns:foo="bar" foo:bar="something">blah</p>
<p xmlns:foo="bar" foo:xxx="something">blah</p>

Is it possible to match the start of an attribute so any attribute that begins with foo:? I've edited by question. — John Jiang, Mar 06 '13 at 10:20

Find All Elements Given Namespaced Attribute

2 Answers2

Linked