69

Consider:

<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>

I want to extract the source (i.e., src) attribute from an image (i.e., img) tag using Beautiful Soup. I use Beautiful Soup 4, and I cannot use a.attrs['src'] to get the src, but I can get href. What should I do?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
iDelusion
  • 775
  • 1
  • 8
  • 9

4 Answers4

97

You can use Beautiful Soup to extract the src attribute of an HTML img tag. In my example, the htmlText contains the img tag itself, but this can be used for a URL too, along with urllib2.

For URLs

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

For texts with the img tag

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

Python 3:

from bs4 import BeautifulSoup as BSHTML
import urllib

page = urllib.request.urlopen('https://github.com/abushoeb/emotag')
soup = BSHTML(page)
images = soup.findAll('img')

for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

Install modules if needed

# Python 3
pip install beautifulsoup4
pip install urllib3
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Abu Shoeb
  • 4,747
  • 2
  • 40
  • 45
  • How can I extract image title from img tag with id="my_img", only one specific image – Dipanshu Mahla May 11 '20 at 17:40
  • Since `ID` is not a default attribute of the `image` tag, you can't get anything like `image['id']`. However, if you print the `image` value you'll see all attributes and values. Perhaps you can then tokenize it and find your desired image with the id you are looking for. – Abu Shoeb May 11 '20 at 20:03
  • On some systems, e.g. some versions of [Ubuntu](https://en.wikipedia.org/wiki/Ubuntu_%28operating_system%29), the name of the executable is `pip3` (for Python 3, and as the only (default) option), not `pip`. – Peter Mortensen Nov 06 '22 at 16:02
20

A link doesn't have attribute src. You have to target the actual img tag.

import bs4

html = """<div class="someClass">
    <a href="href">
        <img alt="some" src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mx0
  • 6,445
  • 12
  • 49
  • 54
8

Here is a solution that will not trigger a KeyError in case the img tag does not have a src attribute:

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
blastoise
  • 274
  • 2
  • 5
5

You can use Beautiful Soup to extract the src attribute of an HTML img tag. In my example, the htmlText contains the img tag itself, but this can be used for a URL too, along with urllib2.

The solution provided by the Abu Shoeb's answer is not working any more with Python 3. This is the correct implementation:

For URLs

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    # Print image source
    print(image['src'])
    # Print alternate text
    print(image['alt'])

For texts with the 'img' tag

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Marco Lampis
  • 403
  • 5
  • 15