0

I've been experimenting with making a simple Python web crawler, and I'm using regular expressions to find the relevant links. The site I am experimenting with is a wiki, and I want to find only the links whose URLs start with /wiki/. I may expand this to some other parts of the site as well, and so I require my code to be as dynamic as possible.

The currently regex I'm using is

<a\s+href=[\'"]\/wiki\/(.*?)[\'"].*?>

However, the matches it finds do NOT include /wiki/ in them. I was unaware of this property of regular expressions. Ideally, since I may expand this to other parts of the site (eg. /bio/), I would like the regex to return "/wiki/[rest_of_url]" rather than simply "/[rest_of_url". The regex

<a\s+href=[\'|"]\/(.*?)[\'"].*?>

works fine (it finds URLs that start with /) because it returns "/wiki/[rest_of_url]", but it does not ensure that /wiki appears in the text.

How can I do this?

Thanks,

Daniel Moniz

Paragon
  • 2,692
  • 3
  • 20
  • 27

2 Answers2

2

Expand the parentheses so that they include the /wiki/ portion of your regex

    <a\s+href=[\'"](\/wiki\/.*?)[\'"].*?> 

Edit

In re, parentheses allow you to break up your search results into sections. You're telling the re parser to find the entire expression, but only return the portion in parentheses. You can also use multiple sets of parentheses:

    <a\s+href=[\'"](\/wiki\/)(.*?)[\'"].*?> 

In this case, MatchObject.group() will return the entire matched object. If you call MatchObject.groups() however, it will return a tuple containing /wiki/ and whatever matches the contents of the second parentheses. Check out the python.org documentation on regex syntax.

Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
  • Hi Joel. This does seem to work, but I'm not exactly sure why. Could you please explain? – Paragon Feb 12 '12 at 20:21
  • Thanks! I did not know this about regular expressions, and tutorials on them are brutal. – Paragon Feb 12 '12 at 20:47
  • @Paragon: Agreed. Most of the regular expressions tutorials I've seen are not very helpful. I've found Google's [python re tutorial](http://code.google.com/edu/languages/google-python-class/regular-expressions.html) to be halfway decent, however. – Joel Cornett Feb 12 '12 at 20:59
  • @Paragon: You could read [Mastering Regular Expressions book by Jeffrey Friedl](http://regex.info/) to use regexs efficiently in practice. – jfs Feb 12 '12 at 21:36
1

You could use a HTML parser e.g. lxml:

from lxml import html

for element, attribute, link, pos in html.iterlinks(html_string):
    if attribute == 'href' and link.startswith('/wiki'):
       print(link)

Or using BeautifulSoup:

import re
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html_string)
for a in soup.findAll('a', href=re.compile(r'^/wiki')):
    print(a['href'])
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thanks, I should definitely be using this. In the meantime, however, the other answer satisfies my immediate needs. – Paragon Feb 12 '12 at 20:46