First of all, you shouldn't parse HTML with Regular Expressions.
Solution 1
Now, if you are exclusively parsing img
tags, you could come up with a satisfying enough solution like this:
(\b\.jpg|\b\.png)\?(.*?)\"
That is:
(\b\.jpg|\b\.png) # 1st Capturing Group
\b\.jpg # 1st Alternative: match ``.jpg`` literally
\b\.png # 2nd Alternative: match ``.png`` literally
\? # Match the character ? literally
(.+?) # 2nd Capturing Group
.+? # Match any character between one and unlimited times,
# as few times as possible, expanding as needed.
\" # Match the character " literally
Problem
What's the problem? We are not checking if we are inside an img
tag. This will match everywhere in the HTML.
Solution 2
Let's add the check for img > src
:
<img.+?src=\".*?(\b\.jpg|\b\.png)\?(.+?)\"
That is:
<img # Match ``<img`` literally
.+? # Match any character between one and unlimited times,
# as few times as possible, expanding as needed.
# Needed in case there are rel or alt options inside the img tag.
src=\" # Match ``src="`` literally
... # The rest is same as before.
Problem
Does this really do its job? Apparently yes, but in reality no.
Consider the following HTML code
<img src="data:image/png;base64,iVBORw0KG" />
<div style="background-image: url(../images/test-background.jpg?)">
blah blah
</div>
It shouldn't match right? But it does (if you remove line-breaks). The regular expression above starts the match at <img src="
, and will stop at ">
of the div
tag. The capturing group will contain the characters between ?
and "
: )
, substituting it will break the HTML.
This was just an example, but many other situations will match even if they should not.
Other solutions...?
No matter how many constraints you can add to your RegEx and how sophisticated it becomes... HTML is a Context-Free Language and it can't be captured by a Regular Expression, which only recognizes Regular Languages.
In PHP
Still sure you're gonna use Regular Expressions? Alright, then your PHP
function is preg_replace
. You only need to keep in mind that it will replace everything that matched, not only the capturing groups. Hence, you need to wrap what you want to "remember" into another capturing group:
$str = '<img src="folder/img1.jpg?foo">';
$pattern = '/(<img.+?src=\".*?(\b\.jpg|\b\.png)\?)(.+?)(\")/';
$replacement = '$1' . 'bar' . '$4';
$str_replaced = preg_replace($pattern, $replacement, $str);
// Now you have $str_replaced = '<img src="folder/img1.jpg?bar">';