Extract the 'src' attribute from an 'img' tag using Beautiful Soup

Question

Consider:

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

I want to extract the source (i.e., src) attribute from an image (i.e., img) tag using Beautiful Soup. I use Beautiful Soup 4, and I cannot use a.attrs['src'] to get the src, but I can get href. What should I do?

Why would you expect `a.attrs['src']` to work? There's no `` tag with a `src` attribute in the snippet you've shown. — jwodder, May 15 '17 at 16:44
this is also a completely different question than before & the headline makes no sense now. — patrick, May 15 '17 at 17:22
@patrick I used regex to get the `src` .what's the other questions ? — iDelusion, May 15 '17 at 18:10
@jwodder I see that later but when I use `img.attrs['src']` it also got wrong . but later I used regex to get what i want — iDelusion, May 15 '17 at 18:11
Possible duplicate of [Python Beautifulsoup img tag parsing](https://stackoverflow.com/questions/10600079/python-beautifulsoup-img-tag-parsing) — Abu Shoeb, Apr 11 '19 at 19:07

score 97 · Answer 1 · edited Nov 06 '22 at 16:01

You can use Beautiful Soup to extract the src attribute of an HTML img tag. In my example, the htmlText contains the img tag itself, but this can be used for a URL too, along with urllib2.

For URLs

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

For texts with the img tag

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

Python 3:

from bs4 import BeautifulSoup as BSHTML
import urllib

page = urllib.request.urlopen('https://github.com/abushoeb/emotag')
soup = BSHTML(page)
images = soup.findAll('img')

for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

Install modules if needed

# Python 3
pip install beautifulsoup4
pip install urllib3

How can I extract image title from img tag with id="my_img", only one specific image — Dipanshu Mahla, May 11 '20 at 17:40
Since `ID` is not a default attribute of the `image` tag, you can't get anything like `image['id']`. However, if you print the `image` value you'll see all attributes and values. Perhaps you can then tokenize it and find your desired image with the id you are looking for. — Abu Shoeb, May 11 '20 at 20:03
On some systems, e.g. some versions of [Ubuntu](https://en.wikipedia.org/wiki/Ubuntu_%28operating_system%29), the name of the executable is `pip3` (for Python 3, and as the only (default) option), not `pip`. — Peter Mortensen, Nov 06 '22 at 16:02

score 20 · Answer 2 · edited Nov 06 '22 at 15:47

A link doesn't have attribute src. You have to target the actual img tag.

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

score 8 · Answer 3 · edited Nov 06 '22 at 16:07

8

Here is a solution that will not trigger a KeyError in case the img tag does not have a src attribute:

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])

edited Nov 06 '22 at 16:07

Peter Mortensen

30,738
21
105
131

answered Sep 18 '20 at 06:10

blastoise

274
2
5

Re *"KeyError"*: Is an exception thrown? – Peter Mortensen Nov 06 '22 at 16:07

score 5 · Answer 4 · edited Nov 06 '22 at 16:05

You can use Beautiful Soup to extract the src attribute of an HTML img tag. In my example, the htmlText contains the img tag itself, but this can be used for a URL too, along with urllib2.

The solution provided by the Abu Shoeb's answer is not working any more with Python 3. This is the correct implementation:

For URLs

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

For texts with the 'img' tag

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

Extract the 'src' attribute from an 'img' tag using Beautiful Soup

4 Answers4

Linked

Related