Understanding this regular expression

Question

I have a regular expression I got from Internet to match URL links in <a> tags. It appears as below:

variable = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')

Would anyone please explain me how exactly is this patten going to match the contents of an <a> tag?

I have basic understanding of regular expression in Unix but this looks too complicated for me and appreciate anybody explaining this to me.

Did you check the [Python regular expression](http://docs.python.org/2/library/re.html) documentation? Was there anything in it specifically you didn't understand? — Martijn Pieters, Jan 10 '13 at 09:01
Write up a number of href-tags and try removing specific parts of the regex to see how it changes what it matches and what doesn't match. — dutt, Jan 10 '13 at 09:03
and the usual must: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — root, Jan 10 '13 at 09:04

score 3 · Answer 1 · answered Jan 10 '13 at 09:02

3

'<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>'

lets break it down.

<a is just that, the start of a tag.
\s means a whitespace.
(?:.*?\s)*? means a non-capturing group, repeated as many times as it can, or not at all, the contents of this group are .*?: anything, and then a whitespace.
href= is just that, part of the tag.
[\'"] means either ' or "
(.*?) is your capturing group, which captures anything.
[\'"] means either ' or "
.*? anything, or nothing
> just that, the end of the tag.

what does this mean in english?

<a ANYTHING href=URL>

ANYTHING is ignored, and URL is captured.

small details:

the URL is surrounded with quotation characters, either ' or " (hence the inclusion in the regex).
ANYTHING are possible attributes that might exist on the link.
if you understand basic HTML, then you know that any link is in tags <a> ... </a> or <a ... >
the href= is the attribute we want - which is the link address.

answered Jan 10 '13 at 09:02

Inbar Rose

41,843
24
85
131

It may be worth mentioning the `?` being used to make qualifiers non-greedy. (In this case, it prevents two or more tags being read as one) – DanielB Jan 10 '13 at 09:07
@DanielB yes, this code is dynamic enough that if you DID want to capture the attributes you could just remove the `?:` from the non-capturing group to make it capturing. – Inbar Rose Jan 10 '13 at 09:09
For example, here's the same pattern with one `?` removed http://regex.utahraptor.info/r/2/ – DanielB Jan 10 '13 at 09:13
@DanielB dude, what is that site - how does it work? – Inbar Rose Jan 10 '13 at 09:15
I don't want to spam here, but it's a javascript-client with a python-server. There's a contact email on the site if you want more details. – DanielB Jan 10 '13 at 09:19

minocha · Answer 2 · 2013-01-14T09:59:26.380

0

Well the @Inbar rose has already answered your question in detail, but there may be some links which will have a problem when you use the regular expression for getting the links.. Incase you can get them by using the normal split functions, taking into consideration the general html syntax -

a='<a href="http://www.google.com">'r
print a.split('href=')[1].split('"')[1]

>> http://www.google.com

edited Jan 14 '13 at 09:59

answered Jan 10 '13 at 11:52

minocha

1,043
1
12
26

you wrote in your split `'href='` but it is not in your example. which is somewhat confusing. – Inbar Rose Jan 13 '13 at 08:07
@InbarRose - I'm sorry that was a mistake.. I've edited it again. – minocha Jan 14 '13 at 09:59

Understanding this regular expression

2 Answers2