How can I specify required text and have that text show up in regex matches?

Question

I've been experimenting with making a simple Python web crawler, and I'm using regular expressions to find the relevant links. The site I am experimenting with is a wiki, and I want to find only the links whose URLs start with /wiki/. I may expand this to some other parts of the site as well, and so I require my code to be as dynamic as possible.

The currently regex I'm using is

<a\s+href=[\'"]\/wiki\/(.*?)[\'"].*?>

However, the matches it finds do NOT include /wiki/ in them. I was unaware of this property of regular expressions. Ideally, since I may expand this to other parts of the site (eg. /bio/), I would like the regex to return "/wiki/[rest_of_url]" rather than simply "/[rest_of_url". The regex

<a\s+href=[\'|"]\/(.*?)[\'"].*?>

works fine (it finds URLs that start with /) because it returns "/wiki/[rest_of_url]", but it does not ensure that /wiki appears in the text.

How can I do this?

Thanks,

Daniel Moniz

The obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — DSM, Feb 12 '12 at 20:08
I don't see why the first one doesn't work, but you can try simplifying it to ]+href=['"]?(/wiki/[^'">]+). You should really consider using one of the html parsers out there (Beautiful Soup for example) — Optimist, Feb 12 '12 at 20:10
Can you post your code? Also, are you using `re.match`? Maybe you need to be using `re.search` or `re.findall`. — Joel Cornett, Feb 12 '12 at 20:11
Ah, now I see. If you want the regex to return /wiki/...stuff.., you need to expand your parentheses to include the /wiki/ part. — Joel Cornett, Feb 12 '12 at 20:12

Joel Cornett · Accepted Answer · 2012-02-12T20:35:47.940

2

Expand the parentheses so that they include the /wiki/ portion of your regex

    <a\s+href=[\'"](\/wiki\/.*?)[\'"].*?>

Edit

In re, parentheses allow you to break up your search results into sections. You're telling the re parser to find the entire expression, but only return the portion in parentheses. You can also use multiple sets of parentheses:

    <a\s+href=[\'"](\/wiki\/)(.*?)[\'"].*?>

In this case, MatchObject.group() will return the entire matched object. If you call MatchObject.groups() however, it will return a tuple containing /wiki/ and whatever matches the contents of the second parentheses. Check out the python.org documentation on regex syntax.

edited Feb 12 '12 at 20:35

answered Feb 12 '12 at 20:18

Joel Cornett

24,192
9
66
88

Hi Joel. This does seem to work, but I'm not exactly sure why. Could you please explain? – Paragon Feb 12 '12 at 20:21
Thanks! I did not know this about regular expressions, and tutorials on them are brutal. – Paragon Feb 12 '12 at 20:47
@Paragon: Agreed. Most of the regular expressions tutorials I've seen are not very helpful. I've found Google's [python re tutorial](http://code.google.com/edu/languages/google-python-class/regular-expressions.html) to be halfway decent, however. – Joel Cornett Feb 12 '12 at 20:59
@Paragon: You could read [Mastering Regular Expressions book by Jeffrey Friedl](http://regex.info/) to use regexs efficiently in practice. – jfs Feb 12 '12 at 21:36

score 1 · Answer 2 · answered Feb 12 '12 at 20:37

1

You could use a HTML parser e.g. lxml:

from lxml import html

for element, attribute, link, pos in html.iterlinks(html_string):
    if attribute == 'href' and link.startswith('/wiki'):
       print(link)

Or using BeautifulSoup:

import re
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html_string)
for a in soup.findAll('a', href=re.compile(r'^/wiki')):
    print(a['href'])

answered Feb 12 '12 at 20:37

jfs

399,953
195
994
1,670

Thanks, I should definitely be using this. In the meantime, however, the other answer satisfies my immediate needs. – Paragon Feb 12 '12 at 20:46

How can I specify required text and have that text show up in regex matches?

2 Answers2