I have sets of HTML anchor elements enclosing image elements. For each set, using PHP-CLI, I want to pull the URLs and classify them according to their types. The type of anchor can only be determined by an attribute of its child image element. It would be easy if there was only one of each type per set. My problem is when two anchor elements of one type are separated by one or more of the other types. My non-greedy parenthesized sub-pattern seems to become greedy and expands to find the second relevant child attribute. In my test script I'm trying to pull the 'Userlink' URLs from amongst the other types. Using a simple pattern like:
#<a href="(.*?)" custattr="value1"><img alt="Userlink"#
On a set like:
<li><a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic0.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet1.com/username1" custattr="value1"><img alt="Socnet1" class="common_link_class" height="123" src="pic1.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet2.com/username1" custattr="value1"><img alt="Socnet2" class="common_link_class" height="123" src="pic2.png" width="123" style="width: 123px;"></a></li><li><a href="mailto:useralias1@unlikely.zyx321.usermail.net" custattr="value1"><img alt="Usermail" class="common_link_class" height="123" src="pic3.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic4.png" width="123" style="width: 123px;"></a></li>
(sorry, but the actual html is on one line like that)
My sub-pattern captures from the beginning of the first "Userlink" URL to the end of the last one.
I've tried many variations of look-aheads, not sure I should list them all here. So far they've either returned no match at all or the same as described above.
Here's my test script (running in a Bash shell):
#!/usr/bin/php
<?
$lines = 0;
$input = "";
$matches = array();
while ($line = fgets(STDIN)){
$input .= $line;
$lines++;
}
fwrite(STDERR, "Processing $lines\n");
$pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#';
if (preg_match_all($pcre,$input,$matches)){
fwrite(STDERR, "\$matches has " . count($matches) . " elements\n");
foreach ($matches[1] as $match){
fwrite(STDOUT, $match . "\n");
}
}
?>
What PCRE pattern for PHP's preg_match_all() would return the two "Userlink" URLs in the above example?