How to find all strings by bs4?

Question

I want to parse a specific page with some images, but images are not in a fixed tag a, here are some examples:

<meta name="description" content="This is Text."><meta name="Keywords" content="Weather"><meta property="og:type" content="article"><meta property="og:title" content="Cloud"><meta property="og:description" content="Testing"><meta property="og:url" content="https://weathernews.jp/s/topics/201807/300285/"><meta property="og:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><meta name="twitter:title" content="【天地始粛】音や景色から感じる秋の気配"><meta name="twitter:description" content="28日からは「天地始粛(てんちはじめてさむし)」。 「粛」にはおさまる、弱まる等の意味があり、夏の暑さもようやく落ち着いてくる頃とされています。"><meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"><link rel="canonical" href="https://weathernews.jp/s/topics/201807/300285/"><link rel="amphtml" href="https://weathernews.jp/s/topics/201807/300285/amp.html"><script async="async" src="https://www.googletagservices.com/tag/js/gpt.js"></script>
<img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
<img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518">

I tried to use code as below to get all images, but no any results, what can I do?

soup.find_all(string=re.compile(r"(https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+)\?[0-9]+"))

The subject of this post, and the content are contradicting. What exactly are you asking for? — Joseph Seung Jae Dollar, Aug 28 '18 at 00:46

alecxe · Accepted Answer · 2018-08-28T01:14:06.660

I personally think this is one of the rare cases when applying a regular expression to the complete document without using an HTML parser is the easiest and a good way to go. And, since you are actually just looking for URLs and not matching any HTML tags in the regular expression, points made in this thread are not valid for this case:

In [1]: data = """
   ...: <meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
   ...: <img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https:
   ...: //smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
   ...: <img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518
   ...: ">
   ...: """

In [2]: import re

In [3]: pattern = re.compile(r"https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+\?[0-9]+")

In [4]: pattern.findall(data)
Out[4]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

If you are though interested in how would you apply a regular expression pattern to multiple attributes in BeautifulSoup, it may be something along these lines (not pretty, I know):

In [6]: results = soup.find_all(lambda tag: any(pattern.search(attr) for attr in tag.attrs.values()))

In [7]: [next(attr for attr in tag.attrs.values() if pattern.search(attr)) for tag in results]
Out[7]: 
[u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 u'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

Here we are basically iterating over all attributes of all elements and checking for a pattern match. Then, once we get all the matching tags we are iterating over the results and get a value of a matching attribute. I really don't like the fact that we apply the regex check twice - when looking for tags and when checking for a desired attribute of a matched tag.

lxml.html and it's XPath powers allow working with attributes directly, but lxml supports XPath 1.0 which does not have regular expression support. You can do smth like:

In [10]: from lxml.html import fromstring

In [11]: root = fromstring(data)

In [12]: root.xpath('.//@*[contains(., "smtgvs.weathernews.jp") and contains(., "?")]') 
Out[12]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

which is not 100% what you did and would probably generate false positives, but you can take it further and add more "substring in a string" checks if needed.

Or, you can grab all the attributes of all elements and filter using the regex you already have:

In [14]: [attr for attr in root.xpath("//@*") if pattern.search(attr)]
Out[14]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

Maybe with a caveat that it will match content in comments etc also — Peter Gibson, Aug 28 '18 at 00:56
Thanks for your longer answer. but I am not so good at this topic, can you parse all images from here `url = 'https://weathernews.jp/s/topics/201807/300285/?fm=onebox' resp = requests.get(url) soup = BeautifulSoup(resp.text, 'lxml')` — mikezang, Aug 28 '18 at 02:14
@mikezang the basic main idea of the answer is that you don't need to use beautifulsoup for that task - just do a regular expression search inside the `resp.text`. — alecxe, Aug 28 '18 at 03:52
@alecxe can you help me to parse `https://weathernews.jp/s/topics/201807/300285/?fm=onebox`? — mikezang, Aug 28 '18 at 06:06
I got one is `https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869"> — mikezang, Aug 28 '18 at 06:21

How to find all strings by bs4?

1 Answers1