I'm parsing image links from external webpages in my php script. This is my pattern:
$pattern = '/<img[^<>]+?src=["\']([^<>]+?)["\']/';
I found tags like this in some pages:
<img class="avatar-32" src="<%= avatar %>" />
That's why the [^<>]
And I don't know how to get other potencial error tags
So I wanted to know, how to perfect my pattern to accept just the valid img tags.
There are questions like:
- Can there be spaces between
src
and=
and"
? - Between ´<´ and
img
? - Even newlines?
- What if I find a
'
in src attribute?
In fact how browsers parse links?
Note: I didn't add extensions because the links can be:
http://www.example.com/img.jpg?1234
http://www.example.com/img.php
http://www.example.com/img/
Also I have a relative to absolute link converter. So the conversion is not the problem