-2

I have a regular expression I got from Internet to match URL links in <a> tags. It appears as below:

variable = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')

Would anyone please explain me how exactly is this patten going to match the contents of an <a> tag?

I have basic understanding of regular expression in Unix but this looks too complicated for me and appreciate anybody explaining this to me.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Did you check the [Python regular expression](http://docs.python.org/2/library/re.html) documentation? Was there anything in it specifically you didn't understand? – Martijn Pieters Jan 10 '13 at 09:01
  • Write up a number of href-tags and try removing specific parts of the regex to see how it changes what it matches and what doesn't match. – dutt Jan 10 '13 at 09:03
  • and the usual must: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – root Jan 10 '13 at 09:04

2 Answers2

3

'<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>'

lets break it down.

  • <a is just that, the start of a tag.
  • \s means a whitespace.
  • (?:.*?\s)*? means a non-capturing group, repeated as many times as it can, or not at all, the contents of this group are .*?: anything, and then a whitespace.
  • href= is just that, part of the tag.
  • [\'"] means either ' or "
  • (.*?) is your capturing group, which captures anything.
  • [\'"] means either ' or "
  • .*? anything, or nothing
  • > just that, the end of the tag.

what does this mean in english?

<a ANYTHING href=URL>

ANYTHING is ignored, and URL is captured.

small details:

  • the URL is surrounded with quotation characters, either ' or " (hence the inclusion in the regex).
  • ANYTHING are possible attributes that might exist on the link.
  • if you understand basic HTML, then you know that any link is in tags <a> ... </a> or <a ... >
  • the href= is the attribute we want - which is the link address.
Inbar Rose
  • 41,843
  • 24
  • 85
  • 131
  • It may be worth mentioning the `?` being used to make qualifiers non-greedy. (In this case, it prevents two or more tags being read as one) – DanielB Jan 10 '13 at 09:07
  • @DanielB yes, this code is dynamic enough that if you DID want to capture the attributes you could just remove the `?:` from the non-capturing group to make it capturing. – Inbar Rose Jan 10 '13 at 09:09
  • For example, here's the same pattern with one `?` removed http://regex.utahraptor.info/r/2/ – DanielB Jan 10 '13 at 09:13
  • @DanielB dude, what is that site - how does it work? – Inbar Rose Jan 10 '13 at 09:15
  • I don't want to spam here, but it's a javascript-client with a python-server. There's a contact email on the site if you want more details. – DanielB Jan 10 '13 at 09:19
0

Well the @Inbar rose has already answered your question in detail, but there may be some links which will have a problem when you use the regular expression for getting the links.. Incase you can get them by using the normal split functions, taking into consideration the general html syntax -

a='<a href="http://www.google.com">'r
print a.split('href=')[1].split('"')[1]

>> http://www.google.com
minocha
  • 1,043
  • 1
  • 12
  • 26