2

I want to create regex that match the text inside opening and its matching closing angle brackets of html img tag with PHP. Let's say I have the html text in variable $searchThis

$searchThis = "<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>";

I want to match the content in tags which ellipsis is substitution for. The result must be the following matches:

src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'

This is how I imagine the pattern should be and which actually doesn't work for me:

$pattern = "<img([^\/]+)\/>";
0xC0DEGURU
  • 1,432
  • 1
  • 18
  • 39
  • 1
    You shouldn't try to parse HTML using regular expression. Use XPath or some similar XML access approach instead. Have a look at [this collection](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-xml/3577662#3577662). – Till Helge Apr 16 '13 at 10:46
  • You want to get the output only by regex? what about simplehtmldom? – Jenson M John Apr 16 '13 at 10:46
  • Ok, but I won't use anything out of the PHP standard library. – 0xC0DEGURU Apr 16 '13 at 10:51

3 Answers3

2

Try:

preg_match_all("`<img (.*)/>`Uis", $searchThis, $results);
print_r($results);

Printing the structure of $results will show you its content.

Note: If you wish to be more accurate, I would suggest you to include src= in your search and go until the closing quote mark, in order to to only select the image address. Then you can add the missing text (src=) afterwards.
This way, you still gets the relative path, even when your image tag doesn't look like expected (i.e. there are other stuffs in the tag like alt="Smiley face" height="42" width="42").

Jean
  • 7,623
  • 6
  • 43
  • 58
  • I don't know. I alway use the character ` to quote the string pattern, so I can add stuffs like U, i, s to tune the search options. – Jean Apr 16 '13 at 10:54
  • Any non-alphanumeric character can be used as a regex delimiter, although I must admit I've never seen the backtick used for this purpose before. – Tim Pietzcker Apr 16 '13 at 10:55
  • I don't remember where I saw it for the first time. But since, I only use this character. I looks more clear to me. Maybe I am juste used to. – Jean Apr 16 '13 at 10:57
  • Let's say HTML is like this: `` (new line after ` – anubhava Apr 16 '13 at 11:02
  • No, because I put a space between ` – Jean Apr 16 '13 at 11:06
  • i - case insensitive, s - single line. What's about U? – 0xC0DEGURU Apr 16 '13 at 11:12
  • @Jean: I completely understand how to make it work using regex but what I meant to say is regex approach has a huge risk while parsing HTML. – anubhava Apr 16 '13 at 11:14
  • @anubhava Of course you do. Sorry, I mis-looked who was posting the comment. I agree with you, but 0xc0deguru says: regex only. – Jean Apr 16 '13 at 11:29
  • @0xC0DEGURU the `U` option means ungreedy. It is to force the regex to start looking from `` encountered. Try it without the `U` and you will see the difference. You may very well get the whole HTML code starting from the first `` encountered. – Jean Apr 16 '13 at 11:31
  • I got it. Like "?" but in "global scope". – 0xC0DEGURU Apr 16 '13 at 12:16
2

Never try to parse HTML with regex. For parsing HTML use DOM Parser. Consider code like this:

$html = <<< EOF
<html><div></div><img src='/relative/path/img1.png'/></div>
<img src='/relative/path/img2.png'/><div></div></div>
<img src='/relative/path/img3.png'/><ul><li></li></ul></html>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//img");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $src = $node->attributes->getNamedItem('src')->nodeValue;
    echo "src='$src'\n";
}

OUTPUT:

src='/relative/path/img1.png'
src='/relative/path/img2.png'
src='/relative/path/img3.png'
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • Thanks! I didn't know that in the standard PHP library there is a DOM Parser. I don't use PHP often and that's not my strong. – 0xC0DEGURU Apr 16 '13 at 11:00
  • @0xC0DEGURU: Even I have not coded in PHP for my work. Its only while answering questions on SO I learnt it :P – anubhava Apr 16 '13 at 11:04
0

Example Parsing With simplehtmldom

    <?php
    include("simplehtmldom/simple_html_dom.php");
    // Create DOM from URL or file
    $html = str_get_html("<html><div></div><img src='/relative/path/img1.png'/></div>
    <img src='/relative/path/img2.png'/><div></div></div>
    <img src='/relative/path/img3.png'/><ul><li></li></ul></html>");

    // Find all images
    foreach($html->find('img') as $element)
           echo $element->src . '<br>';
    ?>
Jenson M John
  • 5,499
  • 5
  • 30
  • 46