Python to parser a web page's images URLs

Question

This is my code to get a web page's image's URLs

for some webpage, it works very well, while it' dosen't work for some web page

this is my code: #!/usr/bin/python

import urllib2
import re
#bufOne = urllib2.urlopen(r"http://vgirl.weibo.com/5show/user.php?fid=17262", timeout=4).read()
bufTwo = urllib2.urlopen(r"http://541626.com/pages/38307", timeout=4).read()

jpgRule = re.findall(r'http://[\w/]*?jpg', bufOne, re.IGNORECASE)
jpgRule = re.findall(r'http://[\w/]*?jpg', bufTwo, re.IGNORECASE)
print jpgRule

bufOne work well, but bufTwodidn't work. so how to write a ruler for it to make bufTwo work well?

score 8 · Accepted Answer · edited May 23 '17 at 11:45

8

Don't use regex to parse HTML. Rather use Beautiful Soup to find all img tags and then get the src attributes.

from BeautifullSoup import BeautifullSoup

#...

soup = BeautifulSoup(bufTwo)
imgTags = soup.findAll('img')
img = [tag['src'] for tag in imgTags]

edited May 23 '17 at 11:45

Community

1
1

answered Mar 22 '12 at 10:01

ddk

1,813
1
15
18

Thinks, but how to understand `[tag['src'] for tag in imgTags]` – thlgood Mar 22 '12 at 13:31
1

Its a list comprehension. `imgTags` is a list of `Tag` objects (look at the BeautifullSoup documentation for more info). The list comprehension makes a new list that will contain the values of all the `src` attributes. It is just a quick way of doing `img = []; for tag in imgTags: img.appent(tag['src'])`. – ddk Mar 22 '12 at 13:41

score 0 · Answer 2 · answered Jul 16 '12 at 23:52

I'll take this chance ddk gave to show you an easier way of getting all the images. Using Beautiful Soup like that:

from BeautifulSoup import BeautifulSoup
all_imgs = soup.findAll("img", { "src" : re.compile(r'http://[\w/]*?jpg') })

That will already give you a list with all the images you want.

Python to parser a web page's images URLs

2 Answers2