0

I am trying to determine the best way to save an unknown string on a web page that relates to a specific tag, using Python. E.g.

<div class="pictures">
    <img src="http://some.unknownaddress.com/random_image.jpg" alt="" class="image" height="123" width="123">

What I wish to pull out is the images URL address and use it to download the image. The class "pictures" is unique to the page so I gather I can use that as a reference point to grab the URL, but what I'm not sure of is how to write the code to specifically select what even URL is inbetween the " " following that "pictures" class.

I am thinking down the line of using re, but have no idea how to concoct a string to make it select that particualar string. Should I be using Beautiful Soup to help?

Any help would be much appreciated.

Thanks,

Dog.

user788462
  • 1,085
  • 2
  • 15
  • 24

3 Answers3

2

Use lxml and CSS selectors

Python 2.7 (r27:82525, Jul  4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import document_fromstring
>>> doc = """<html>
... <body>
... <div class="pictures">
...     <img src="http://some.unknownaddress.com/random_image1.jpg" alt="" class="image" height="123" width="123">
...     <img src="http://some.unknownaddress.com/random_image2.jpg" alt="" class="image" height="123" width="123">
... </div>
... <div class="pictures">
...     <img src="http://some.unknownaddress.com/random_image3.jpg" alt="" class="image" height="123" width="123">
...     <img src="http://some.unknownaddress.com/random_image4.jpg" alt="" class="image" height="123" width="123">
... </div>
... </body>
... </html>"""
>>> html = document_fromstring(doc)

>>> html.cssselect(".pictures img")
[<Element img at 0x2423f00>, <Element img at 0x242f2d0>, <Element img at 0x242f150>, <Element img at 0x242f210>]

>>> print "\n".join(x.attrib['src'] for x in html.cssselect(".pictures img"))
http://some.unknownaddress.com/random_image1.jpg
http://some.unknownaddress.com/random_image2.jpg
http://some.unknownaddress.com/random_image3.jpg
http://some.unknownaddress.com/random_image4.jpg

Or XPath:

>>> html.xpath("//div[@class='pictures']/img")
[<Element img at 0x2787c60>, <Element img at 0x2787c90>, <Element img at 0x2787cf0>, <Element img at 0x242f210>]

>>> print "\n".join(html.xpath("//div[@class='pictures']/img/@src"))
http://some.unknownaddress.com/random_image1.jpg
http://some.unknownaddress.com/random_image2.jpg
http://some.unknownaddress.com/random_image3.jpg
http://some.unknownaddress.com/random_image4.jpg
Steven Kryskalla
  • 14,179
  • 2
  • 40
  • 42
  • Thank you for the very quick reply, and wow that's impressive code. I'm amazed it can be done so simply. I have so much to learn :( – user788462 Jun 08 '11 at 02:12
1

This is messy but would get the job done. Obviously it'd be better to break this down into functions, etc. to make it smoother. Note that I haven't tested this script specifically, but I have written other scripts in this ilk to do similar things (break down html, add stuff in, and paste it back together, for instance). It's a bit tedious, and not pretty, but again...it'll work.

start = 0
end = 0
charCount = -1
imgTagLocation = []
for character in SourceCode:
    charCount += 1
    if character == "<":
       start = charCount
       end = charCount + 4
       testString = SourceCode[start:end]
       if testString == "<img":
           imgTagLocation.append(start)
           endTag = None
           while not endTag:
               if endTag:
                   break
               else:
                  endCount = -1
                  for char in SourceCode[start:]:
                      endCount += 1
                      if char == ">":
                          endTag = start + endCount
                          imgTagLocation.append(endTag)
           imgTag = SourceCode[imgTagLocation[0]:imgTagLocation[1]
           startInImgTag = 0
           testString = "src"
           excerpt = ""
           while testString != excerpt:
               if testString == excerpt:
                   [[continue to break this down until you are searching for the quotation marks within the "src" part of the img tag, and then return the string between those marks]]
               endInImgTag = startInImgTag + 3
               excerpt = imgTag[startInImgTag:endInImgTag]
               startInImgTag += 1
Thomas Thorogood
  • 2,150
  • 3
  • 24
  • 30
  • Congratulations, you've just implemented an HTML parser, but with fewer features and less testing than those available from a one-line import :P – detly Jun 08 '11 at 02:15
  • It's true. Except I find it easier to do it myself than figure out how to use someone else's ^_^. I also like cooking from scratch. A trend? – Thomas Thorogood Jun 08 '11 at 02:20
  • I don't dispute that it's worth something to implement an idea yourself to see how it works. But then I'd throw it away and use the tested one :P – detly Jun 08 '11 at 02:28
  • Thanks Tom, nice logical flow and easy to follows whats going on. Perfect for beginners like me :) – user788462 Jun 08 '11 at 02:41
0

This is very easy to do from BeautifulSoup as well. It's quite similar to the answer using lxml. BeautifulSoup will actually use lxml as a parser if it is available, otherwise it defaults to the pure-Python html5lib. Anyways, here is how you do it:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

pictures = [tag.get('src') for tag in soup.select('.pictures img')]

print(*pictures, sep='\n')
Six
  • 5,122
  • 3
  • 29
  • 38