I've been experimenting with making a simple Python web crawler, and I'm using regular expressions to find the relevant links. The site I am experimenting with is a wiki, and I want to find only the links whose URLs start with /wiki/. I may expand this to some other parts of the site as well, and so I require my code to be as dynamic as possible.
The currently regex I'm using is
<a\s+href=[\'"]\/wiki\/(.*?)[\'"].*?>
However, the matches it finds do NOT include /wiki/ in them. I was unaware of this property of regular expressions. Ideally, since I may expand this to other parts of the site (eg. /bio/), I would like the regex to return "/wiki/[rest_of_url]" rather than simply "/[rest_of_url". The regex
<a\s+href=[\'|"]\/(.*?)[\'"].*?>
works fine (it finds URLs that start with /) because it returns "/wiki/[rest_of_url]", but it does not ensure that /wiki appears in the text.
How can I do this?
Thanks,
Daniel Moniz