Webscraper will not work

Question

I have followed a tutorial pretty much to the letter, and I want my scraper to scrape all the links to the specific pages containing the info about each police station, but it returns the entire site almost.

from urllib import urlopen
import re

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

b = re.compile('<span class="listlink-police"><a href="(.*)">')
a = re.findall(b, f)

listiterator = []
listiterator[:] = range(0,16)

for i in listiterator:
    print a 
    print "\n"

f.close()

http://www.youtube.com/watch?v=Ap_DlSrT-iE I did notice he mentions beautifulsoup, but i know that my script uses none of it's functions — Damian Stelucir, Apr 09 '12 at 19:31
emergencyassistanceuk.co.uk is going to have no clue why they have so much traffic right now ... ;) — Nix, Apr 09 '12 at 19:37
lol@Nix.. so true. On a more practical note, it's a static unchanging list so retieval and regexing is a tad pointless. Just cut+paste the source code into a word-processor or dreamweaver and convert to CSV. — Skizz, Apr 11 '12 at 01:34

score 7 · Answer 1 · answered Apr 09 '12 at 19:36

7

Use BeautifulSoup

from bs4 import BeautifulSoup
from urllib2 import urlopen

f = urlopen("http://www.emergencyassistanceuk.co.uk/list-of-uk-police-stations.html").read()

bs = BeautifulSoup(f)

for tag in bs.find_all('span', {'class': 'listlink-police'}):
    print tag.a['href']

answered Apr 09 '12 at 19:36

KurzedMetal

12,540
6
39
65

3

"Thanks, did just what I needed." is best expressed ["by clicking on the check box outline to the left of the answer"](http://stackoverflow.com/faq#howtoask). – johnsyweb Apr 21 '12 at 00:53

score 3 · Answer 2 · answered Apr 09 '12 at 19:35

You are using regex to parse HTML. You shouldn't, because you end up with just this type of problem. For a start, the .* wildcard will match as much text as it can. But once you fix that, you will pluck another fruit from the Tree of Frustration. Use a proper HTML parser instead.

score -1 · Answer 3 · edited Apr 09 '12 at 19:36

-1

There are over 1.6k links with that class on it.

I think its working correctly... what makes you think it's not working?

And you should definitely use Beautiful Soup, it's stupid simple and extremely useable.

edited Apr 09 '12 at 19:36

Michael

8,920
3
38
56

answered Apr 09 '12 at 19:32

Nix

57,072
29
149
198

Yeah, but it prints the html, i am trying to get it to print everything between the " " on the a tag. I thought this script does just that. – Damian Stelucir Apr 09 '12 at 19:35
You should reword your question `but it returns the entire site almost` to my regex is too greedy. – Nix Apr 09 '12 at 19:36

Webscraper will not work

3 Answers3

Linked