How can I extract the result string in BeautifulSoap?

Question

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

start_url = 'https://www.example.com'
downloaded_html = requests.get(start_url)
soup = BeautifulSoup(downloaded_html.text, "lxml")
full_header = soup.select('div.reference-image')
full_header

The Output of the above code is;

[<div class="reference-image"><img src="Content/image/all/reference/c101.jpg"/></div>,
 <div class="reference-image"><img src="Content/image/all/reference/c102.jpg"/></div>,
 <div class="reference-image"><img src="Content/image/all/reference/c102.jpg"/></div>]

I would like to extract the img src content as below;

["Content/image/all/reference/c101.jpg",
 "Content/image/all/reference/c102.jpg",
 "Content/image/all/reference/c102.jpg"]

How can I extract it?

https://stackoverflow.com/questions/28212766/extract-string-from-tag-with-beautifulsoup — fernand0, Apr 15 '20 at 16:06
Can you clarify what exactly the issue is? Please see [ask], [help/on-topic]. — AMC, Apr 15 '20 at 17:39
This is a blatant duplicate of [Extracting an attribute value with beautifulsoup](https://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup). — AMC, Apr 15 '20 at 17:41
Thank you @JoshuaVarghese, it perfectly does. Also, is there a way to handle this situation with BeautifulSoup's own functionality? — Murat Dikici, Apr 20 '20 at 12:59

Joshua Varghese · Accepted Answer · 2020-04-15T17:44:35.737

2

To get that, just iterate through the result:

img_srcs = []
for i in full_header:
    img_srcs.append(i.find('img')['src'])

This gives:

['Content/image/all/reference/c101.jpg', 'Content/image/all/reference/c102.jpg', 'Content/image/all/reference/c102.jpg']

Here is a one-liner for this:

img_srcs = [i.find('img')['src'] for i in full_header]

edited Apr 15 '20 at 17:44

answered Apr 15 '20 at 17:38

Joshua Varghese

5,082
1
13
34

How can I extract the result string in BeautifulSoap?

1 Answers1