2

I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly.

Here is my code:

import re, os
import urllib.request
def get_image(url):
  url = 'http://www.google.com'
  total = 0
  try:
    f = urllib.request.urlopen(url)
    for line in f.readline():
      line = re.compile('<img.*?src="(.*?)">')
      if total > 0:
        x = line.count(total)
        total += x
        print('Images total:', total)

  except:
    pass
user2537246
  • 143
  • 2
  • 10

2 Answers2

1

using beautifulsoup4 (an html parser) rather than a regex:

import urllib.request

import bs4  # beautifulsoup4

html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))
Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
0

A couple of points about your code:

  1. It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
  2. You're over-writing your line variable in the loop
  3. total will always be 0 with your current logic
  4. no need to compile your RE, as it will be cached by the interpreter
  5. you're discarding your exception, so no clues about what's going on in the code!
  6. there could be other attributes to the <img> tags.. so your Regex is a little basic, also, use the re.findall() method to catch multiple instances on the same line...

changing your code around a little, I get:

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")
msturdy
  • 10,479
  • 11
  • 41
  • 52
  • Thank you so much!! this code is way better and I will take a note of your comments and try to install Beautiful Soup! – user2537246 Jun 30 '13 at 22:55
  • no problem.. don't forget to accept the answer if it's what you needed! Take note of Corey's answer as well, that's a very good example of how simple these tasks are with Beautiful Soup! – msturdy Jun 30 '13 at 22:58
  • Perfect! where this accept is located in this site, can't find it :( – user2537246 Jun 30 '13 at 23:05
  • There should be a tick at the side of the answer, under the up/down arrows and the answer's score – msturdy Jun 30 '13 at 23:07