Counting HTML images with Python

Question

I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly.

Here is my code:

import re, os
import urllib.request
def get_image(url):
  url = 'http://www.google.com'
  total = 0
  try:
    f = urllib.request.urlopen(url)
    for line in f.readline():
      line = re.compile('<img.*?src="(.*?)">')
      if total > 0:
        x = line.count(total)
        total += x
        print('Images total:', total)

  except:
    pass

hum well I could do an except: except urllib.error.HTTPError: if such url is not found — user2537246, Jun 30 '13 at 22:36

score 1 · Answer 1 · answered Jun 30 '13 at 22:41

1

using beautifulsoup4 (an html parser) rather than a regex:

import urllib.request

import bs4  # beautifulsoup4

html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))

answered Jun 30 '13 at 22:41

Corey Goldberg

59,062
28
129
143

hum I can't do beautifulsoup4 in my IDLE, i get a traceback error. – user2537246 Jun 30 '13 at 22:44

score 0 · Accepted Answer · answered Jun 30 '13 at 22:43

0

A couple of points about your code:

It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
You're over-writing your line variable in the loop
total will always be 0 with your current logic
no need to compile your RE, as it will be cached by the interpreter
you're discarding your exception, so no clues about what's going on in the code!
there could be other attributes to the <img> tags.. so your Regex is a little basic, also, use the re.findall() method to catch multiple instances on the same line...

changing your code around a little, I get:

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")

answered Jun 30 '13 at 22:43

msturdy

10,479
11
41
52

Thank you so much!! this code is way better and I will take a note of your comments and try to install Beautiful Soup! – user2537246 Jun 30 '13 at 22:55
no problem.. don't forget to accept the answer if it's what you needed! Take note of Corey's answer as well, that's a very good example of how simple these tasks are with Beautiful Soup! – msturdy Jun 30 '13 at 22:58
Perfect! where this accept is located in this site, can't find it :( – user2537246 Jun 30 '13 at 23:05
There should be a tick at the side of the answer, under the up/down arrows and the answer's score – msturdy Jun 30 '13 at 23:07

Counting HTML images with Python

2 Answers2

Linked