Optimizing python link matching regular expression

Question

I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(data)

to find links in some html, it is taking a long time on certain html, any optimization advice?

obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Carson Myers, May 31 '10 at 18:47

score 2 · Answer 1 · answered May 31 '10 at 18:41

2

Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.

answered May 31 '10 at 18:41

Daenyth

is it possible to get all of the data that the regex gets? the link, anchor text and the bits between a and href and after href until the end of the tag? – Matt May 31 '10 at 18:47
@Matt: I find it very difficult to understand what your regex is doing, but the general idea of HTML parsers is that they make it easy to parse HTML. I'm sure whatever it is that you're trying to do it's quite straightforward once you've read the documentation. – Mark Byers May 31 '10 at 18:56
1

Yes, very much so. This appears to be a duplicate of your question, and is answered: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – Daenyth May 31 '10 at 18:58

score 2 · Answer 2 · answered May 31 '10 at 18:41

2

I'd suggest using BeautifulSoup for this task.

answered May 31 '10 at 18:41

Mark Byers

score 0 · Answer 3 · answered May 31 '10 at 19:24

How about more straight handling of href's?

re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)

That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.

3 Answers3