0

How to write a regex expression that gets all img tags, and inside them, gets the "src" value, ignoring all the imgs tags that has a given class? Let's say I would like to get all srcs of img tags that don't have "dontGetMe" assigned to its classes (but may still have other classes)

i.e.

<img src="teste1.jpg" class="blueClass brightClass dontGetMe" />
<img src="teste2.jpg" class="blueClass" />
<img src="teste3.jpg" class="dontGetMe" />
<img src="teste4.jpg" />

On the example, my regex should get teste2.jpg and teste4.jpg.

The regex I got so far is the following (which gets all the imgs src values regardless of the presence of the "dontGetMe" class):

((?:\<img).*)(src)

! This regex will be used on a php script, so it has to run succesfully on "http://www.phpliveregex.com".

EDIT: The regex would be used in the following php function: I totally agree that regex doesn't seems to be the most clear and guaranteed way to do it, but still, my lack of php knowledge ties me with this technology.

function Advanced_lazyload($buffer)
{
    (...)
    $pattern = '(REGEX EXPRESSION GOES HERE)';
    $buffer = preg_replace($pattern, "$1 src='temp.gif' ImageHolder", $buffer);
    return $buffer;
}
Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Marcelo Myara
  • 2,841
  • 2
  • 27
  • 36
  • That current regex will not get you all the src values (as you say it does). – Smern Jun 26 '14 at 17:27
  • This is not a task for RegEx. Load that source into a DOM loader or BeautifulSoup (with Python). Using RegEx in your case will cause way to much trouble. –  Jun 26 '14 at 17:29
  • **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Jun 26 '14 at 18:48

1 Answers1

4

Dont use regex for parsing html. The task is for xml parser.

The recommended way is to use XPath for this.

$doc = new DOMDocument();
$doc->loadHTML($html);
$dox = new DOMXPath($doc);
$elements = $dox->query('//img[not(contains(@class, "dontGetMe"))]/@src');
foreach($elements as $el){
   echo $el->nodeValue, "\n";
}
Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187