0

I have followed a tutorial pretty much to the letter, and I want my scraper to scrape all the links to the specific pages containing the info about each police station, but it returns the entire site almost.

from urllib import urlopen
import re

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

b = re.compile('<span class="listlink-police"><a href="(.*)">')
a = re.findall(b, f)

listiterator = []
listiterator[:] = range(0,16)

for i in listiterator:
    print a 
    print "\n"

f.close()
Damian Stelucir
  • 55
  • 1
  • 4
  • 9
  • 1
    Please cite the tutorial you followed. – Nix Apr 09 '12 at 19:29
  • http://www.youtube.com/watch?v=Ap_DlSrT-iE I did notice he mentions beautifulsoup, but i know that my script uses none of it's functions – Damian Stelucir Apr 09 '12 at 19:31
  • 2
    emergencyassistanceuk.co.uk is going to have no clue why they have so much traffic right now ... ;) – Nix Apr 09 '12 at 19:37
  • lol@Nix.. so true. On a more practical note, it's a static unchanging list so retieval and regexing is a tad pointless. Just cut+paste the source code into a word-processor or dreamweaver and convert to CSV. – Skizz Apr 11 '12 at 01:34

3 Answers3

7

Use BeautifulSoup

from bs4 import BeautifulSoup
from urllib2 import urlopen

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

bs = BeautifulSoup(f)

for tag in bs.find_all('span', {'class': 'listlink-police'}):
    print tag.a['href']
KurzedMetal
  • 12,540
  • 6
  • 39
  • 65
  • 3
    "Thanks, did just what I needed." is best expressed ["by clicking on the check box outline to the left of the answer"](http://stackoverflow.com/faq#howtoask). – johnsyweb Apr 21 '12 at 00:53
3

You are using regex to parse HTML. You shouldn't, because you end up with just this type of problem. For a start, the .* wildcard will match as much text as it can. But once you fix that, you will pluck another fruit from the Tree of Frustration. Use a proper HTML parser instead.

tripleee
  • 175,061
  • 34
  • 275
  • 318
-1

There are over 1.6k links with that class on it.

I think its working correctly... what makes you think it's not working?


And you should definitely use Beautiful Soup, it's stupid simple and extremely useable.

Michael
  • 8,920
  • 3
  • 38
  • 56
Nix
  • 57,072
  • 29
  • 149
  • 198
  • Yeah, but it prints the html, i am trying to get it to print everything between the " " on the a tag. I thought this script does just that. – Damian Stelucir Apr 09 '12 at 19:35
  • You should reword your question `but it returns the entire site almost` to my regex is too greedy. – Nix Apr 09 '12 at 19:36