BeautifulSoup cannot find a HTML tag that contains certain text

Question

I just want to use regex to retrieve all elements which has "//" in my HTML string, and I follow the answer from this question: Using BeautifulSoup to find a HTML tag that contains certain text

And then I code a similar one:

from BeautifulSoup import BeautifulSoup
import re

html_text = \
"""
<html>
    <!--&lt;![endif]-->
    <head>
        <link rel="stylesheet" href="//abc.com/xyz" />
        <meta rel="stylesheet" href="//foo.com/bar" />
    </head>
</html>
"""

soup = BeautifulSoup(html_text)

for elem in soup(text=re.compile(r'//')):
    print elem

I expect that I would have the result like:

//abc.com/xyz
//foo.com/bar

But I receive nothing. I don't know why their test case works but mine, is there any error or did I miss something in my script?

In their example they are searching for `text` content of their tags, yours are defined as `href` attributes. Try substituting `text` with `href` (i.e. `soup(href=re.compile(r"//"))`). — zwer, Jul 06 '17 at 10:49

M. Leung · Accepted Answer · 2017-07-06T11:22:33.043

2

Wrong attributes set

soup = BeautifulSoup(html_text, 'lxml')

for elem in soup(href=re.compile(r'//')):
    print elem.get('href')

Extract method for the question in comment, you need to parse the data after found out which tag contains the data.

def has_requires_chars(tag):
    value_list = []
    attrs_value = tag.attrs.values()
    for avalue in attrs_value:
        if type(avalue) is list:
            value_list = value_list + avalue
        else:
            value_list.append(avalue)
    for value in value_list:
        if "//" in value:
            return True
    return False

soup = BeautifulSoup(html_text, 'lxml')
for elem in soup.find_all(has_requires_chars):
    print elem

edited Jul 06 '17 at 11:22

answered Jul 06 '17 at 10:47

M. Leung

1,621
1
9
9

thanks a lot, it works, but what if it's not always "href" attribute, how can we also capture this? – Blurie Jul 06 '17 at 10:51
You may write your own method and send to `find_all()` if there is no suitable filter for your case. – M. Leung Jul 06 '17 at 11:19

BeautifulSoup cannot find a HTML tag that contains certain text

1 Answers1