3

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:

import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)

def get_img_cnt(url):
  try:
      w =  urllib.request.urlopen(url)
  except IOError:
      sys.stderr.write("Couldn't connect to %s " % url)
      sys.exit(1)
  contents =  str(w.read())
  img_num = len(img_pat.findall(contents))
  return (img_num)

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
kflaw
  • 424
  • 1
  • 10
  • 26

3 Answers3

10

Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:

from bs4 import BeautifulSoup
import requests


def get_img_cnt(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content)

    return len(soup.find_all('img'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

Here's a working example using lxml and requests:

from lxml import etree
import requests


def get_img_cnt(url):
    response = requests.get(url)
    parser = etree.HTMLParser()
    root = etree.fromstring(response.content, parser=parser)

    return int(root.xpath('count(//img)'))


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

Both snippets print 106.

Also see:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

Ahhh regular expressions.

Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.

Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"

You should come up with the right count by making .* non-greedy, like this:

<img.*?>

Ali Alkhatib
  • 425
  • 2
  • 8
  • thanks that does work. I don't understand what the ? is doing? – kflaw Aug 18 '13 at 20:05
  • It says to the regex to stop the search at the first `>` encounters, not the latest. So it will catch every `` and not just a big `` (which could contains other – Maxime Lorant Aug 18 '13 at 20:07
  • 1
    The `?` tells the regular expression to match the arbitrary `.*` pattern with as _few_ characters as possible, rather than as _many_ (which is the default). So if we personify regex a bit longer, it would see `` as soon as it possibly could to end that match. – Ali Alkhatib Aug 18 '13 at 20:08
1

Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.

img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.

Colonel Panic
  • 1,604
  • 2
  • 20
  • 31