0

I am trying to parse webpages to find links to special pages

for instance if we had the below as input

flowers that never end.')" onmouseout="return nd();" href="/flowers/images/download/01d6ac.html"><img src="http://static.rarbg.com/over/01d6acc21110e68af7476bce50dec3c234343032.jpg" border="0

and on an other page had :

flowers that never end')" onmouseout="return nd();" href="/flowers/01d6acc21110e68af7476bce50dec3c234343032.html" src="http://static.rarbg.com/over/01d6acc21110e68af7476bce50dec3c234343032.jpg" border="0

I tried to use the below re to pick up the link:

'href="/flowers/(.+?)"[^>]

but it is still picking up the link from both inputs not just the second one! can anyone help me?

Max
  • 4,152
  • 4
  • 36
  • 52
  • 2
    possible duplicate of [Using regular expressions to parse HTML: why not?](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not), or with a longer, detailled answer and 4000+ upvotes: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – phihag Jul 10 '11 at 21:11
  • 2
    Why shouldn't it pick up the second link? – Otto Allmendinger Jul 10 '11 at 21:12
  • @otto-allmendinger second one ends with "> and when it does it means the image in that link has smaller size but when it does not end with "> and href goes on, it is the proper one, it is the style of this specific website. – Max Jul 10 '11 at 21:19
  • @phihag I cant use any other tool than regex :( – Max Jul 10 '11 at 21:20
  • 2
    @Max Why not? There are excellent HTML parsers for python, like the built-in [etree](http://lxml.de/parsing.html), [HTMLParser](http://docs.python.org/library/htmlparser.html) or [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) – phihag Jul 10 '11 at 21:25
  • Exactly. Given that the standard library includes tools for doing this right, why try doing it with regex? – Whatang Jul 10 '11 at 21:30
  • @phihag Because environment I am working in would not allow me and because it is a part of larger system that relies on a regex pattern being fed to it and because the nasty person in charge says so :( – Max Jul 10 '11 at 21:31

1 Answers1

3

If for some reason you have to use regex, better use this expression:

'href="/flowers/([^"]+)"[^>]'

However your suffering will continue until you use a parser as you can read in the comments.

Otto Allmendinger
  • 27,448
  • 7
  • 68
  • 79