0

What is wrong with regex pattern that I created:

$link_image_pattern = '/\<a\shref="([^"]*)"\>\<img\s.+\><\/a\>/';
preg_match_all($link_image_pattern, $str, $link_images);

What I'm trying to do is to match all the links which has images inside of them. But when I try to output $link_images it contains everything inside the first index:

<pre>
  <?php print_r($link_images); ?>
</pre>

The markup looks something like this:

Array ( [0] => Array ([0] => "

<p>&nbsp;</p>

<p><strong><a href="url">Title</a></strong></p>

<p>Desc</p>

<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>

But when outputting the contents of the matches, it simply returns the first string that matches the pattern plus all the other markup in the page like this:

<a href="{$image_url}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url}" width="568" height="347"></a></p>

    <p>&nbsp;</p>

    <p><strong><a href="url">Title</a></strong></p>

    <p>Desc</p>

    <p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>")
user225269
  • 10,743
  • 69
  • 174
  • 251
  • index0 will contain the whole string that matched the expression – DevZer0 Jun 30 '13 at 06:36
  • Use DomDocument library to read HTML and get its data. – Prix Jun 30 '13 at 06:37
  • possible duplicate of [Matching SRC attribute of IMG tag using preg\_match](http://stackoverflow.com/questions/2180255/matching-src-attribute-of-img-tag-using-preg-match) – Anirudha Jun 30 '13 at 06:39
  • Refer to above question and refer the answer which uses an html parser **NOT** regex – Anirudha Jun 30 '13 at 06:39
  • Regex is not a good way to parse HTML, see the following answer [Parse anchor tags which have img tag as child element](http://stackoverflow.com/questions/17357583/parse-anchor-tags-which-have-img-tag-as-child-element) – Prix Jun 30 '13 at 06:51
  • possible duplicate of [How to parse HTML with PHP?](http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php) – raam86 Jun 30 '13 at 08:16

2 Answers2

3

Forward

Regex may not be the best solution to parse HTML, but there are cases where it is the only option such as your text editor doesn't have a "insert html parsing script here" option in the search & replace form. If you are actually using PHP then you'd be better off using a parsing script like:

$Document = new DOMXPath($doc);
foreach ($Document->query('//a//img')) {
# do something with it here
}

Description

This format generally keeps the you-can't-do-that-in-regex haters away. It'll ensure your anchor tag has contains an img tag. While at the same time preventing the odd (and very improbable) edge case where the attribute has something that looks like an image tag.

<a\b(?=\s|>)     # match the open anchor tag
(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>=])*    # match the contents of the tag, skipping over the quoted values
>    # match the close of the anchor tag
<img\b(?=\s|>)    # match the open img tag
(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>=])*     # match the contents of the img tag, skipping over the quoted value
>   # match the close of the img tag
<\/a>   # matcn the close anchor tag

PHP Code Example:

Sample Text

Note the last line has an ugly attribute which will foil most other regular expression.

<p>&nbsp;</p>
<p><strong><a href="url">Title</a></strong></p>
<p>Desc</p>
<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>

<p><a href="{$image_url2}" Onmouseover="function(' ><img src=picture.png></a> ');" >I do not have an image</a></p>

enter image description here

Code

<?php
$sourcestring="your source string";
preg_match_all('/<a\b(?=\s|>)
(?:=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*|[^>=])*
>
<img\b(?=\s|>)
(?:=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*|[^>=])*
>
<\/a>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

[0] => <a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a>
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
-1

maybe the problem is in .+\> part because it matches everything till the last >

try the same method as you use for stoping on ": [^\>]+ this works in my editor

<a.+><img[^>]+></a>

for your need and you have only to add some backslashes \ before <, > and /

vladkras
  • 16,483
  • 4
  • 45
  • 55
  • Regex is not the way to parse HTML have you notice how many edits you've done on the past few minutes/seconds, not to mention this question is a duplicate. – Prix Jun 30 '13 at 06:52
  • 1
    @Prix 1. be honest, my last edit 21 min ago, your link - 17 min ago, so you did it 4 min later 2. try to read the question before minusing, he tries to "match", not "parse" 3. I can make as many edits as want in 5 minutes, and you'd better notice smth else – vladkras Jun 30 '13 at 07:13
  • still the regex is not the way, he could use strpos, he could still DomDocument and no I wasn't referring to your edit 21 minutes ago I was referring myself to all of your edits during the period prior my comment which was more than 4 which proves regex are not easy to deal with for parsing HTML where you could have done that way easier using DomDocument and matching the extracted string or even using strpos or similar option IF he is comparing a link. – Prix Jun 30 '13 at 07:17
  • I'm glad you spent so much time following my answer, but the only reason to edit it again and again was because I was not sure he could use it correctly, e.g. if it's clear, that he needs to escape some chars (my text editor doesn't need it) – vladkras Jun 30 '13 at 07:27
  • 1
    @Prix Are you downvoting because it's wrong or because it's "not the right way"? – raam86 Jun 30 '13 at 07:40
  • @raam86 I am downvoting it because regex is not a good option to parse HTML the OP itself is having trouble with it and this poster also had as well. If you feel that regex if the way to parse HTML do by all means upvote. Keep in mind that there is also duplicate for his question with a good approach that he can use that is on the OP comments. – Prix Jun 30 '13 at 07:51
  • @Prix why are you so stubborn? it's an answer for those who look for correct regex, not "the best way to parse HTML using XPath (what?!?!)" – vladkras Jun 30 '13 at 08:01
  • @Prix I'm not a huge fan either but vladkras has point...Just vote to close. – raam86 Jun 30 '13 at 08:15