Regex for matching all text inside opening and closing angle brackets of img tag

Question

I want to create regex that match the text inside opening and its matching closing angle brackets of html img tag with PHP. Let's say I have the html text in variable $searchThis

$searchThis = "<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>";

I want to match the content in tags which ellipsis is substitution for. The result must be the following matches:

src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'

This is how I imagine the pattern should be and which actually doesn't work for me:

$pattern = "<img([^\/]+)\/>";

You shouldn't try to parse HTML using regular expression. Use XPath or some similar XML access approach instead. Have a look at [this collection](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml/3577662#3577662). — Till Helge, Apr 16 '13 at 10:46
You want to get the output only by regex? what about simplehtmldom? — Jenson M John, Apr 16 '13 at 10:46
Ok, but I won't use anything out of the PHP standard library. — 0xC0DEGURU, Apr 16 '13 at 10:51

Jean · Answer 1 · 2013-04-16T10:59:48.130

2

Try:

preg_match_all("`<img (.*)/>`Uis", $searchThis, $results);
print_r($results);

Printing the structure of $results will show you its content.

Note: If you wish to be more accurate, I would suggest you to include src= in your search and go until the closing quote mark, in order to to only select the image address. Then you can add the missing text (src=) afterwards.
This way, you still gets the relative path, even when your image tag doesn't look like expected (i.e. there are other stuffs in the tag like alt="Smiley face" height="42" width="42").

edited Apr 16 '13 at 10:59

answered Apr 16 '13 at 10:47

Jean

7,623
6
43
58

I don't know. I alway use the character ` to quote the string pattern, so I can add stuffs like U, i, s to tune the search options. – Jean Apr 16 '13 at 10:54
Any non-alphanumeric character can be used as a regex delimiter, although I must admit I've never seen the backtick used for this purpose before. – Tim Pietzcker Apr 16 '13 at 10:55
I don't remember where I saw it for the first time. But since, I only use this character. I looks more clear to me. Maybe I am juste used to. – Jean Apr 16 '13 at 10:57
Let's say HTML is like this: `` (new line after ` – anubhava Apr 16 '13 at 11:02
No, because I put a space between ` – Jean Apr 16 '13 at 11:06
i - case insensitive, s - single line. What's about U? – 0xC0DEGURU Apr 16 '13 at 11:12
@Jean: I completely understand how to make it work using regex but what I meant to say is regex approach has a huge risk while parsing HTML. – anubhava Apr 16 '13 at 11:14
@anubhava Of course you do. Sorry, I mis-looked who was posting the comment. I agree with you, but 0xc0deguru says: regex only. – Jean Apr 16 '13 at 11:29
@0xC0DEGURU the `U` option means ungreedy. It is to force the regex to start looking from `` encountered. Try it without the `U` and you will see the difference. You may very well get the whole HTML code starting from the first `` encountered. – Jean Apr 16 '13 at 11:31
I got it. Like "?" but in "global scope". – 0xC0DEGURU Apr 16 '13 at 12:16

score 2 · Answer 2 · answered Apr 16 '13 at 10:50

Never try to parse HTML with regex. For parsing HTML use DOM Parser. Consider code like this:

$html = <<< EOF
<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $src = $node->attributes->getNamedItem('src')->nodeValue;
    echo "src='$src'\n";
}

OUTPUT:

src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'

Thanks! I didn't know that in the standard PHP library there is a DOM Parser. I don't use PHP often and that's not my strong. — 0xC0DEGURU, Apr 16 '13 at 11:00
@0xC0DEGURU: Even I have not coded in PHP for my work. Its only while answering questions on SO I learnt it :P — anubhava, Apr 16 '13 at 11:04

score 0 · Answer 3 · answered Apr 16 '13 at 10:54

0

Example Parsing With simplehtmldom

    <?php
    include("simplehtmldom/simple_html_dom.php");
    // Create DOM from URL or file
    $html = str_get_html("<html><div></div><img src='/relative/path/img1.png'/></div>
    <img src='/relative/path/img2.png'/><div></div></div>
    <img src='/relative/path/img3.png'/><ul><li></li></ul></html>");

    // Find all images
    foreach($html->find('img') as $element)
           echo $element->src . '<br>';
    ?>

answered Apr 16 '13 at 10:54

Jenson M John

5,499
5
30
46

It seems the most elegant solution but I have to use only the standard library :) Thanx – 0xC0DEGURU Apr 16 '13 at 11:02

Regex for matching all text inside opening and closing angle brackets of img tag

3 Answers3