count the number of images on a webpage, using urllib

Question

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:

import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)

def get_img_cnt(url):
  try:
      w =  urllib.request.urlopen(url)
  except IOError:
      sys.stderr.write("Couldn't connect to %s " % url)
      sys.exit(1)
  contents =  str(w.read())
  img_num = len(img_pat.findall(contents))
  return (img_num)

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

score 10 · Answer 1 · edited May 23 '17 at 10:27

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:

from bs4 import BeautifulSoup
import requests


def get_img_cnt(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    return len(soup.find_all('img'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

Here's a working example using lxml and requests:

from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

Both snippets print 106.

Also see:

Hope that helps.

score 2 · Accepted Answer · answered Aug 18 '13 at 20:02

2

Ahhh regular expressions.

Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.

Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"

You should come up with the right count by making .* non-greedy, like this:

<img.*?>

answered Aug 18 '13 at 20:02

Ali Alkhatib

425
2
8

thanks that does work. I don't understand what the ? is doing? – kflaw Aug 18 '13 at 20:05
It says to the regex to stop the search at the first `>` encounters, not the latest. So it will catch every `` and not just a big `` (which could contains other – Maxime Lorant Aug 18 '13 at 20:07
1

The `?` tells the regular expression to match the arbitrary `.*` pattern with as _few_ characters as possible, rather than as _many_ (which is the default). So if we personify regex a bit longer, it would see `` as soon as it possibly could to end that match. – Ali Alkhatib Aug 18 '13 at 20:08

Colonel Panic · Answer 3 · 2013-08-18T20:06:01.743

1

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.

img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.

A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

edited Aug 18 '13 at 20:06

answered Aug 18 '13 at 19:58

Colonel Panic

1,604
2
20
31

count the number of images on a webpage, using urllib

3 Answers3