Best way to pull out an unknown string from a known tag in a web page using Python

Question

I am trying to determine the best way to save an unknown string on a web page that relates to a specific tag, using Python. E.g.

<div class="pictures">
    <img src="http://some.unknownaddress.com/random_image.jpg" alt="" class="image" height="123" width="123">

What I wish to pull out is the images URL address and use it to download the image. The class "pictures" is unique to the page so I gather I can use that as a reference point to grab the URL, but what I'm not sure of is how to write the code to specifically select what even URL is inbetween the " " following that "pictures" class.

I am thinking down the line of using re, but have no idea how to concoct a string to make it select that particualar string. Should I be using Beautiful Soup to help?

Any help would be much appreciated.

Thanks,

Dog.

obligatory link to: http://stackoverflow.com/questions/1732348 — Alastair Pitts, Jun 08 '11 at 01:32
Alastair, not obligatory because regexps were never mentioned. — Nick ODell, Jun 08 '11 at 01:47
@Nick ODell - *"I am thinking down the line of using re..."* — detly, Jun 08 '11 at 02:07

score 2 · Answer 1 · answered Jun 08 '11 at 01:53

Use lxml and CSS selectors

Python 2.7 (r27:82525, Jul  4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml.html import document_fromstring
>>> doc = """<html>
... <body>
... <div class="pictures">
...     <img src="http://some.unknownaddress.com/random_image1.jpg" alt="" class="image" height="123" width="123">
...     <img src="http://some.unknownaddress.com/random_image2.jpg" alt="" class="image" height="123" width="123">
... </div>
... <div class="pictures">
...     <img src="http://some.unknownaddress.com/random_image3.jpg" alt="" class="image" height="123" width="123">
...     <img src="http://some.unknownaddress.com/random_image4.jpg" alt="" class="image" height="123" width="123">
... </div>
... </body>
... </html>"""
>>> html = document_fromstring(doc)

>>> html.cssselect(".pictures img")
[<Element img at 0x2423f00>, <Element img at 0x242f2d0>, <Element img at 0x242f150>, <Element img at 0x242f210>]

>>> print "\n".join(x.attrib['src'] for x in html.cssselect(".pictures img"))
http://some.unknownaddress.com/random_image1.jpg
http://some.unknownaddress.com/random_image2.jpg
http://some.unknownaddress.com/random_image3.jpg
http://some.unknownaddress.com/random_image4.jpg

Or XPath:

>>> html.xpath("//div[@class='pictures']/img")
[<Element img at 0x2787c60>, <Element img at 0x2787c90>, <Element img at 0x2787cf0>, <Element img at 0x242f210>]

>>> print "\n".join(html.xpath("//div[@class='pictures']/img/@src"))
http://some.unknownaddress.com/random_image1.jpg
http://some.unknownaddress.com/random_image2.jpg
http://some.unknownaddress.com/random_image3.jpg
http://some.unknownaddress.com/random_image4.jpg

Thank you for the very quick reply, and wow that's impressive code. I'm amazed it can be done so simply. I have so much to learn :( — user788462, Jun 08 '11 at 02:12

score 1 · Accepted Answer · answered Jun 08 '11 at 02:09

This is messy but would get the job done. Obviously it'd be better to break this down into functions, etc. to make it smoother. Note that I haven't tested this script specifically, but I have written other scripts in this ilk to do similar things (break down html, add stuff in, and paste it back together, for instance). It's a bit tedious, and not pretty, but again...it'll work.

start = 0
end = 0
charCount = -1
imgTagLocation = []
for character in SourceCode:
    charCount += 1
    if character == "<":
       start = charCount
       end = charCount + 4
       testString = SourceCode[start:end]
       if testString == "<img":
           imgTagLocation.append(start)
           endTag = None
           while not endTag:
               if endTag:
                   break
               else:
                  endCount = -1
                  for char in SourceCode[start:]:
                      endCount += 1
                      if char == ">":
                          endTag = start + endCount
                          imgTagLocation.append(endTag)
           imgTag = SourceCode[imgTagLocation[0]:imgTagLocation[1]
           startInImgTag = 0
           testString = "src"
           excerpt = ""
           while testString != excerpt:
               if testString == excerpt:
                   [[continue to break this down until you are searching for the quotation marks within the "src" part of the img tag, and then return the string between those marks]]
               endInImgTag = startInImgTag + 3
               excerpt = imgTag[startInImgTag:endInImgTag]
               startInImgTag += 1

Congratulations, you've just implemented an HTML parser, but with fewer features and less testing than those available from a one-line import :P — detly, Jun 08 '11 at 02:15
It's true. Except I find it easier to do it myself than figure out how to use someone else's ^_^. I also like cooking from scratch. A trend? — Thomas Thorogood, Jun 08 '11 at 02:20
I don't dispute that it's worth something to implement an idea yourself to see how it works. But then I'd throw it away and use the tested one :P — detly, Jun 08 '11 at 02:28
Thanks Tom, nice logical flow and easy to follows whats going on. Perfect for beginners like me :) — user788462, Jun 08 '11 at 02:41

score 0 · Answer 3 · answered Jan 17 '16 at 13:18

This is very easy to do from BeautifulSoup as well. It's quite similar to the answer using lxml. BeautifulSoup will actually use lxml as a parser if it is available, otherwise it defaults to the pure-Python html5lib. Anyways, here is how you do it:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

pictures = [tag.get('src') for tag in soup.select('.pictures img')]

print(*pictures, sep='\n')

Best way to pull out an unknown string from a known tag in a web page using Python

3 Answers3