2

From a html/rss snippet like this

[...]<div class="..." style="..."></div><p><a href="..."
<img alt="" heightt="" src="http://link.to/image"
width="" /></a><span style="">[...]

I want to get the image src link "http://link.to/image.jpg". How can I do this in python? Thanks.

SandyBr
  • 11,459
  • 10
  • 29
  • 27
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Ignacio Vazquez-Abrams May 08 '11 at 11:05
  • Is it HTML or RSS? That's an important distinction. And the correct answer is to use proper parser, I'm sure Python has those. – svick May 08 '11 at 11:18
  • ok for RSS I should use a parser, but what if it's html? – SandyBr May 08 '11 at 11:21
  • 2
    If it is RSS you should use an RSS parser (possibly followed by an HTML parser one you extract the HTML). For HTML you should use an HTML parser. – Quentin May 08 '11 at 11:31

5 Answers5

6

lxml is the tool for the job.

To scrape all the images from a webpage would be as simple as this:

import lxml.html

tree = lxml.html.parse("http://example.com")
images = tree.xpath("//img/@src")

print images

Giving:

['/_img/iana-logo-pageheader.png', '/_img/icann-logo-micro.png']

If it was an RSS feed, you'd want to parse it with lxml.etree.

Acorn
  • 49,061
  • 27
  • 133
  • 172
2

Using urllib and beautifulsoup:

import urllib
from BeautifulSoup import BeautifulSoup

f = urllib.urlopen(url)
page = f.read()
f.close()          
soup = BeautifulSoup(page)
for link in soup.findAll('img'):
    print "IMAGE LINKS:", link.get('data-src') 
Guillaume
  • 2,752
  • 5
  • 27
  • 42
0

get html tag data, according to tornado spider

from HTMLParser import HTMLParser

def get_links(html):
    class URLSeeker(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.urls = []

        def handle_starttag(self, tag, attrs):
            if tag == 'img':
                src = dict(attrs).get('src')
                if src:
                    self.urls.append(src)

    url_seeker = URLSeeker()
    url_seeker.feed(html)
    return url_seeker.urls
hustljian
  • 965
  • 12
  • 9
0

Perhaps you should start with reading Regex Howto tutorial and a FAQ in the StackOverflow which says that whenever you are dealing with XMLs (HTML) dont use Regex, but rather using good parsers and in your case, BeautifulSoup is one.

Using Regex, you would do this to get the link to your image:

import re
pattern = re.compile(r'src="(http://.*\.jpg)"')
pattern.search("yourhtmlcontainingtheimagelink").group(1)
Senthil Kumaran
  • 54,681
  • 14
  • 94
  • 131
  • 1
    Pre-emptive dissuasion from using regex, I like it :) – Acorn May 08 '11 at 11:36
  • What if the image is a png: I would use pattern = re.compile(r'src="(.*?)"') – SandyBr May 08 '11 at 11:42
  • Instead of `jpg` you would use `png`. If you do the above, it would give all the src links (.html etc) and not just images. – Senthil Kumaran May 08 '11 at 11:45
  • 2
    @SandyBr - This is why you do not use Regex for parsing HTML. It is a solved problem. Be lazy and use `lxml` (for which I gave the simple snippet that does what you want in my answer below) or BeautifulSoup. [`Read this`](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) if you want to know about the horror that is using Regex for parsing HTML. – Acorn May 08 '11 at 12:10
0

To add to svick's answer, try using the BeautifuSoup parser, it worked for me in the past.

Mihai Oprea
  • 2,051
  • 3
  • 21
  • 39