Regex for links in html text

Question

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the <link href... tags). I have searched the web for matching regexen and found many different patterns. Is there any agreed, standard regex to match links?

Adam

UPDATE: I am actually looking for two different answers:

What's the library solution for parsing HTML links. Beautiful Soup seems to be a good solution (thanks, Igal Serban and cletus!)
Can a link be defined using a regex?

score 17 · Answer 1 · answered Jan 10 '09 at 13:52

17

Regexes with HTML get messy. Just use a DOM parser like Beautiful Soup.

answered Jan 10 '09 at 13:52

cletus

616,129
168
910
942

+1: No, HTML cannot be described by regular expressions. It's more complex. And, worse, browser's are allowed to accept invalid HTML, so web sites send invalid HTML. – S.Lott Jan 10 '09 at 14:39
I swear this question comes up enough to warrant a sticky on the faq – annakata Jan 10 '09 at 16:26

score 8 · Accepted Answer · answered Jan 10 '09 at 17:53

As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

score 5 · Answer 3 · answered Jan 10 '09 at 13:53

5

No there isn't.

You can consider using Beautiful Soup. You can call it the standard for parsing html files.

answered Jan 10 '09 at 13:53

Igal Serban

10,558
3
35
40

score 4 · Answer 4 · answered Jan 10 '09 at 15:10

Shoudln't a link be a well-defined regex?

No, [X]HTML is not in the general case parseable with regex. Consider examples like:

<link title='hello">world' href="x">link</link>
<!-- <link href="x">not a link</link> -->
<![CDATA[ ><link href="x">not a link</link> ]]>
<script>document.write('<link href="x">not a link</link>')</script>

and that's just a few random valid examples; if you have to cope with real-world tag-soup HTML there are a million malformed possibilities.

If you know and can rely on the exact output format of the target page you can get away with regex. Otherwise it is completely the wrong choice for scraping web pages.

All your examples actually ARE parseable by a regex (not to say the last one is invalid). XML SAX parser (which is what the OP needs) is nothing more than a lexer of a language defined by REs. "malformed possibilities" don't change anything about that. — jpalecek, Mar 06 '09 at 20:49

score 3 · Answer 5 · answered Jan 10 '09 at 15:50

Shoudln't a link be a well-defined regex? This is a rather theoretical question,

I second PEZ's answer:

I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.

As far as I know, any HTML tag may contain any number of nested tags. For example:

<a href="http://stackoverflow.com">stackoverflow</a>
<a href="http://stackoverflow.com"><i>stackoverflow</i></a>
<a href="http://stackoverflow.com"><b><i>stackoverflow</i></b></a>
...

Thus, in principle, to match a tag properly you must be able at least to match strings of the form:

BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...

where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.

No, you actually don't need any of that. In HTML, A tags cannot be nested, and what's inside them is beyond what you need to get the links. — jpalecek, Mar 06 '09 at 20:43

score 1 · Answer 6 · answered Jan 10 '09 at 14:19

1

It depends a bit on how the HTML is produced. If it's somewhat controlled you can get away with:

re.findall(r'''<link\s+.*?href=['"](.*?)['"].*?(?:</link|/)>''', html, re.I)

answered Jan 10 '09 at 14:19

PEZ

16,821
7
45
66

score 1 · Answer 7 · answered Jan 10 '09 at 14:24

1

Answering your two subquestions there.

I've sometimes subclassed SGMLParser (included in the core Python distribution) and must say it's straight forward.
I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.

answered Jan 10 '09 at 14:24

PEZ

16,821
7
45
66

It might be. Things aren't always cutting edge where I work. =) – PEZ Jan 10 '09 at 15:34
:-) Any recommendations for a proper py3 replacement? – Adam Matan Jan 10 '09 at 19:39
Not really. Maybe this article can provide some leads: http://www.boddie.org.uk/python/HTML.html – PEZ Jan 10 '09 at 20:23

score 0 · Answer 8 · answered Jan 10 '09 at 15:48

0

In response to question #2 (shouldn't a link be a well defined regular expression) the answer is ... no.

An HTML link structure is a recursive much like parens and braces in programming languages. There must be an equal number of start and end constructs and the "link" expression can be nested within itself.

To properly match a "link" expression a regex would be required to count the start and end tags. Regular expressions are a class of Finite Automata. By definition a Finite Automata cannot "count" constructs within a pattern. A grammar is required to describe a recursive data structure such as this. The inability for a regex to "count" is why you see programming languages described with Grammars as opposed to regular expressions.

So it is not possible to create a regex that will positively match 100% of all "link" expressions. There are certainly regex's that will match a good deal of "link"'s with a high degree of accuracy but they won't ever be perfect.

I wrote a blog article about this problem recently. Regular Expression Limitations

answered Jan 10 '09 at 15:48

JaredPar

733,204
149
1,241
1,454

Both interesting and helpful - thanks. BTW, This problem is solvable by a pushdown stack automaton, which has more computational power than a regular expression - and this can easily be proved using the pumping lemma (http://en.wikipedia.org/wiki/Pumping_lemma) – Adam Matan Jan 10 '09 at 19:45
Not true. The recursive structures in HTML (as tables in tables and many others) are surely not parseable by REs, but LINKs nor As are recursive in HTML, so you just needn't care about the recursive structures to get the links. – jpalecek Mar 06 '09 at 21:15
@jpalecek, you are incorrect. an A tag is most certainly recursive because the content of the A tag can contain another A tag. It might appear weird but it is certainly parsable HTML – JaredPar Mar 06 '09 at 21:17
No, A tag cannot contain A tags. From the HTML 4.01 DTD: "<!ELEMENT A - - (%inline;)* -(A)", the -(A) means there cannot be an A tag nested inside another A tag. XML DTDs cannot express this, but http://www.w3.org/TR/xhtml1/#prohibitions prohibits it. – jpalecek Mar 06 '09 at 21:28
1

@jpalecek, interesting. I usually approach these questions much more from a "is it parsable" than a "is it legal html" because websites tend to be on the side of the former. Even baring that you can still have an literally inside it by embedding in a CDATA or literal string. – JaredPar Mar 06 '09 at 21:48
Yes, but this is actually not "parsable", because browsers don't parse it :-) It's a property that makes the language simpler, browser writers make use of it, so why bother. About CDATA and literals - they are all regular languages, so they aren't obstacles for REs. – jpalecek Mar 06 '09 at 22:16

Regex for links in html text

8 Answers8

Linked