Boost regex, regular expression, url and img

Question

I need to find all links and images in HTML source of the webpage. Actaually I have following expression:

boost::regex findurl("(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^http]{1}[^\\s>]*)['\"]", boost::regex::normal | boost::regbase::icase);

How should it look like to find images ( tag) also?

Careful, you might [summon Cthulhu](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) :) — djf, May 30 '13 at 19:27

score 4 · Answer 1 · answered May 22 '12 at 21:51

It will take you less time to learn Perl and use HTML::Parser than it will for you to debug this regex that won't work on pathological HTML. I can already spot three bugs in it for links, even though you're only asking about images.

This includes sample code that you can probably figure out how to modify even if you don't know Perl. http://perlmeme.org/tutorials/html_parser.html

score 0 · Answer 2 · answered May 22 '12 at 22:14

0

Having a character repeat in a character class ([^http]) doesn't appear correct. djechlin has a point in that a RE is likely to be insufficient but for the simplest of HTMLs.

answered May 22 '12 at 22:14

Happy Green Kid Naps

1,611
11
18

Boost regex, regular expression, url and img

2 Answers2