0

I'm parsing image links from external webpages in my php script. This is my pattern:

$pattern = '/<img[^<>]+?src=["\']([^<>]+?)["\']/';

I found tags like this in some pages:

<img class="avatar-32" src="<%= avatar %>" />

That's why the [^<>] And I don't know how to get other potencial error tags

So I wanted to know, how to perfect my pattern to accept just the valid img tags.

There are questions like:

  1. Can there be spaces between src and = and " ?
  2. Between ´<´ and img ?
  3. Even newlines?
  4. What if I find a ' in src attribute?

In fact how browsers parse links?

Note: I didn't add extensions because the links can be:

http://www.example.com/img.jpg?1234
http://www.example.com/img.php
http://www.example.com/img/

Also I have a relative to absolute link converter. So the conversion is not the problem

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
jscripter
  • 840
  • 1
  • 11
  • 23
  • 4
    Don't parse HTML with regexes. As you're finding out, it's impossible to do consistently/accurately/reliably. Use DOM instead. ALL of your questions go away once you start using DOM operations. – Marc B Feb 20 '14 at 22:53
  • 1
    [Obligatory post to this SO answer](http://stackoverflow.com/a/1732454/383609). On a more helpful note, use the [PHP DOM](http://uk3.php.net/dom) library – Bojangles Feb 20 '14 at 22:55
  • 1
    This is a solved problem. People have already written, tested and debugged code that handles this already. Whenever you have a programming problem that others have probably had to deal with in the past, then look for existing code that does it for you. – Andy Lester Feb 20 '14 at 23:05
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – jillro Feb 20 '14 at 23:11

1 Answers1

1

You better use DOMDocument. It has many and useful functions to find links, textContent, manipulate dom and more.

For example to get the urls of images:

$dom = new DOMDocument;
@$dom->loadHTML($response); //I assume that you're reading/curling pages

foreach ($dom->getElementsByTagName('img') as $node) {
    if ($node->hasAttribute('src')) {
        $url = $node->getAttribute('src');
        //Also you can do some regex here to validate urls 
        //and bypass those like "<%= avatar %>"
        echo $url,'<br>';
    }
}       

These methods can also be very usefull

$node->nodeValue //To get the textContent of the node
$node->childNodes //To get the children of the node. It will return a nodelist object 
                  //as getElementsByTagName('img')
$node->nodeType // Some nodes returned when calling childNodes are textnodes
                //so they can be bypassed with a conditional:
                //if( $node->nodeType == 1){//It's an element node}

$nodes->length // length of a nodelist object 
$nodes->item(1) // 2nd node of a nodelist