2

I need to find all links and images in HTML source of the webpage. Actaually I have following expression:

boost::regex findurl("(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^http]{1}[^\\s>]*)['\"]", boost::regex::normal | boost::regbase::icase);

How should it look like to find images ( tag) also?

bgs
  • 161
  • 5
  • Careful, you might [summon Cthulhu](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) :) – djf May 30 '13 at 19:27

2 Answers2

4

It will take you less time to learn Perl and use HTML::Parser than it will for you to debug this regex that won't work on pathological HTML. I can already spot three bugs in it for links, even though you're only asking about images.

This includes sample code that you can probably figure out how to modify even if you don't know Perl. http://perlmeme.org/tutorials/html_parser.html

djechlin
  • 59,258
  • 35
  • 162
  • 290
0

Having a character repeat in a character class ([^http]) doesn't appear correct. djechlin has a point in that a RE is likely to be insufficient but for the simplest of HTMLs.

Happy Green Kid Naps
  • 1,611
  • 11
  • 18