0

I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(data)

to find links in some html, it is taking a long time on certain html, any optimization advice?

One that it chokes on is http://freeyourmindonline.net/Blog/

Matt
  • 85
  • 1
  • 6

3 Answers3

2

Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.

Daenyth
  • 35,856
  • 13
  • 85
  • 124
  • is it possible to get all of the data that the regex gets? the link, anchor text and the bits between a and href and after href until the end of the tag? – Matt May 31 '10 at 18:47
  • @Matt: I find it very difficult to understand what your regex is doing, but the general idea of HTML parsers is that they make it easy to parse HTML. I'm sure whatever it is that you're trying to do it's quite straightforward once you've read the documentation. – Mark Byers May 31 '10 at 18:56
  • 1
    Yes, very much so. This appears to be a duplicate of your question, and is answered: http://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautiful-soup – Daenyth May 31 '10 at 18:58
2

I'd suggest using BeautifulSoup for this task.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
0

How about more straight handling of href's?

re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)

That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.

ony
  • 12,457
  • 1
  • 33
  • 41