2

This is my code to get a web page's image's URLs

for some webpage, it works very well, while it' dosen't work for some web page

this is my code: #!/usr/bin/python

import urllib2
import re
#bufOne = urllib2.urlopen(r"http://vgirl.weibo.com/5show/user.php?fid=17262", timeout=4).read()
bufTwo = urllib2.urlopen(r"http://541626.com/pages/38307", timeout=4).read()

jpgRule = re.findall(r'http://[\w/]*?jpg', bufOne, re.IGNORECASE)
jpgRule = re.findall(r'http://[\w/]*?jpg', bufTwo, re.IGNORECASE)
print jpgRule

bufOne work well, but bufTwodidn't work. so how to write a ruler for it to make bufTwo work well?

thlgood
  • 1,275
  • 3
  • 18
  • 36

2 Answers2

8

Don't use regex to parse HTML. Rather use Beautiful Soup to find all img tags and then get the src attributes.

from BeautifullSoup import BeautifullSoup

#...

soup = BeautifulSoup(bufTwo)
imgTags = soup.findAll('img')
img = [tag['src'] for tag in imgTags]
Community
  • 1
  • 1
ddk
  • 1,813
  • 1
  • 15
  • 18
  • Thinks, but how to understand `[tag['src'] for tag in imgTags]` – thlgood Mar 22 '12 at 13:31
  • 1
    Its a list comprehension. `imgTags` is a list of `Tag` objects (look at the BeautifullSoup documentation for more info). The list comprehension makes a new list that will contain the values of all the `src` attributes. It is just a quick way of doing `img = []; for tag in imgTags: img.appent(tag['src'])`. – ddk Mar 22 '12 at 13:41
0

I'll take this chance ddk gave to show you an easier way of getting all the images. Using Beautiful Soup like that:

from BeautifulSoup import BeautifulSoup
all_imgs = soup.findAll("img", { "src" : re.compile(r'http://[\w/]*?jpg') })

That will already give you a list with all the images you want.

Guilherme David da Costa
  • 2,318
  • 4
  • 32
  • 46