73

I'm trying to get the elements in an HTML doc that contain the following pattern of text: #\S{11}

<h2> this is cool #12345678901 </h2>

So, the previous would match by using:

soup('h2',text=re.compile(r' #\S{11}'))

And the results would be something like:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

I'm able to get all the text that matches (see line above). But I want the parent element of the text to match, so I can use that as a starting point for traversing the document tree. In this case, I'd want all the h2 elements to return, not the text matches.

Ideas?

Charles Stewart
  • 11,661
  • 4
  • 46
  • 85
sotangochips
  • 2,700
  • 6
  • 28
  • 38
  • 3
    Actually, the h2 restriction is ignored according to the BeautifulSoup documentation: "If you use text, then any values you give for name and the keyword arguments are ignored." – Rabarberski Jun 25 '10 at 14:23
  • 1
    @Rabarberski Not sure what the situation was in 2010, but [by 2012](https://web.archive.org/web/20120427003845/http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-text-argument) finds that use `text` (or `string` which replaced it) would not ignore any other restrictions – T.C. Proctor Jan 20 '18 at 20:11

3 Answers3

86
from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Prints:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
nosklo
  • 217,122
  • 57
  • 293
  • 297
  • Thanks! It's confusing that it returned what looked like a list of unicode strings. I appreciate the help. – sotangochips May 14 '09 at 22:05
  • 2
    `.parent` was awesome! I never thought about it. Thanks @nosklo. +1 – Md. Mohsin Oct 24 '14 at 23:37
  • If you want to iterate the output from the search right away, then for is perfect. Else how about a list comprehension as such: [elem.parent for element in soup(text=re.compile(r' #\S{11}'))] – peterb Sep 03 '16 at 06:25
  • @sotangochips Yeah at first it looks like it's returning a plain unicode string, but it's actually a NavigableString with a `.parent`. Had to use PyCharm's debugger to realise it was not a plain string. – José Tomás Tocino May 08 '18 at 18:51
24

BeautifulSoup search operations deliver [a list of] BeautifulSoup.NavigableString objects when text= is used as a criteria as opposed to BeautifulSoup.Tag in other cases. Check the object's __dict__ to see the attributes made available to you. Of these attributes, parent is favored over previous because of changes in BS4.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True
Bruno Bronosky
  • 66,273
  • 12
  • 162
  • 149
  • 1
    For me `soup.find('h2', text=pattern)` gives the tag directly, no need to call `.parent`. Also the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-string-argument) says that you can combine the `string` (`text` in previous versions) parameter with arguments that find tags. In this case BeautifulSoup will return the tag – robertspierre Jul 15 '17 at 12:21
5

With bs4 (Beautiful Soup 4), the OP's attempt works exactly like expected:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

returns [<h2> this is cool #12345678901 </h2>].

T.C. Proctor
  • 6,096
  • 6
  • 27
  • 37