0

I just want to use regex to retrieve all elements which has "//" in my HTML string, and I follow the answer from this question: Using BeautifulSoup to find a HTML tag that contains certain text

And then I code a similar one:

from BeautifulSoup import BeautifulSoup
import re

html_text = \
"""
<html>
    <!--&lt;![endif]-->
    <head>
        <link rel="stylesheet" href="//abc.com/xyz" />
        <meta rel="stylesheet" href="//foo.com/bar" />
    </head>
</html>
"""

soup = BeautifulSoup(html_text)

for elem in soup(text=re.compile(r'//')):
    print elem

I expect that I would have the result like:

//abc.com/xyz
//foo.com/bar

But I receive nothing. I don't know why their test case works but mine, is there any error or did I miss something in my script?

Blurie
  • 148
  • 1
  • 2
  • 13
  • In their example they are searching for `text` content of their tags, yours are defined as `href` attributes. Try substituting `text` with `href` (i.e. `soup(href=re.compile(r"//"))`). – zwer Jul 06 '17 at 10:49
  • @zwer thanks a lot :D – Blurie Jul 07 '17 at 03:49

1 Answers1

2

Wrong attributes set

soup = BeautifulSoup(html_text, 'lxml')

for elem in soup(href=re.compile(r'//')):
    print elem.get('href')

Extract method for the question in comment, you need to parse the data after found out which tag contains the data.

def has_requires_chars(tag):
    value_list = []
    attrs_value = tag.attrs.values()
    for avalue in attrs_value:
        if type(avalue) is list:
            value_list = value_list + avalue
        else:
            value_list.append(avalue)
    for value in value_list:
        if "//" in value:
            return True
    return False

soup = BeautifulSoup(html_text, 'lxml')
for elem in soup.find_all(has_requires_chars):
    print elem
M. Leung
  • 1,621
  • 1
  • 9
  • 9
  • thanks a lot, it works, but what if it's not always "href" attribute, how can we also capture this? – Blurie Jul 06 '17 at 10:51
  • You may write your own method and send to `find_all()` if there is no suitable filter for your case. – M. Leung Jul 06 '17 at 11:19