I personally think this is one of the rare cases when applying a regular expression to the complete document without using an HTML parser is the easiest and a good way to go. And, since you are actually just looking for URLs and not matching any HTML tags in the regular expression, points made in this thread are not valid for this case:
In [1]: data = """
...: <meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
...: <img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https:
...: //smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
...: <img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518
...: ">
...: """
In [2]: import re
In [3]: pattern = re.compile(r"https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+\?[0-9]+")
In [4]: pattern.findall(data)
Out[4]:
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
If you are though interested in how would you apply a regular expression pattern to multiple attributes in BeautifulSoup
, it may be something along these lines (not pretty, I know):
In [6]: results = soup.find_all(lambda tag: any(pattern.search(attr) for attr in tag.attrs.values()))
In [7]: [next(attr for attr in tag.attrs.values() if pattern.search(attr)) for tag in results]
Out[7]:
[u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
Here we are basically iterating over all attributes of all elements and checking for a pattern match. Then, once we get all the matching tags we are iterating over the results and get a value of a matching attribute. I really don't like the fact that we apply the regex check twice - when looking for tags and when checking for a desired attribute of a matched tag.
lxml.html
and it's XPath powers allow working with attributes directly, but lxml
supports XPath 1.0 which does not have regular expression support. You can do smth like:
In [10]: from lxml.html import fromstring
In [11]: root = fromstring(data)
In [12]: root.xpath('.//@*[contains(., "smtgvs.weathernews.jp") and contains(., "?")]')
Out[12]:
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
which is not 100% what you did and would probably generate false positives, but you can take it further and add more "substring in a string" checks if needed.
Or, you can grab all the attributes of all elements and filter using the regex you already have:
In [14]: [attr for attr in root.xpath("//@*") if pattern.search(attr)]
Out[14]:
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']