1

I've got a string containing html code, and I want to change <img src="anything.jpg"> to <img src="'.DOC_ROOT .'anything.jpg"> everytime it occurs in the string. I really don't want to use an html parser, since this will be the only thing I'll be using it for. Does anyone know how to do this in php, using a regex for example?

Jonan
  • 2,485
  • 3
  • 24
  • 42
  • You seem to know where to look. Have you tried using a regex? – Mark Jan 29 '14 at 17:35
  • well, I tried, but I'm not really good with regex and nothing I tried worked – Jonan Jan 29 '14 at 17:37
  • 1
    https://coderwall.com/p/on3ffa 20secs searching... – soyuka Jan 29 '14 at 17:37
  • that's with a parser, right? – Jonan Jan 29 '14 at 17:39
  • 1
    @Jonan Yes, but [the DOM extension](http://php.net/dom) should be built into your PHP install without any extra code, and it will be less likely to cause subtle problems than a string-based solution. – IMSoP Jan 29 '14 at 17:43
  • @Jonan I've updated my answer to handle other variables such as varying attributes and single quotes, self-closing tags, and a bunch of other things. Perhaps you want to check it out. – Joeytje50 Jan 29 '14 at 18:03
  • chose your answer as the best, although I didn't know the DOM extension is built into PHP, so I'll be using that – Jonan Jan 29 '14 at 18:05
  • By the way, instead of using regex and this whole buisness, you could also just use the HTML tag. – Joeytje50 Jan 29 '14 at 18:10

4 Answers4

4

You really should use a parser but since you made clear that you really don't want to do that, you can use the following regex replace:

$string = preg_replace('/<img([^>]*)src=["\']([^"\'\\/][^"\']*)["\']/', '<img\1src="'.DOC_ROOT.'\2"', $string);

Demo. This regular expression will not modify any urls that are already a relative path. Change it to the following if you do want to match those:

$string = preg_replace('/<img([^>]*)src=["\']["\'\\/]?([^"\']*)["\']/', '<img\1src="'.DOC_ROOT.'\2"', $string);

Demo.

Community
  • 1
  • 1
Joeytje50
  • 18,636
  • 15
  • 63
  • 95
  • 2
    This is a good example of how fragile regex solutions can be. A few cases where this will fail: 1) if there is more than one space between `img` and `src`; 2) if the HTML contains XML-style self-closed tags (``); 3) if the HTML uses single-, not double-quotes around attributes; 4) if the img tags contain other attributes, e.g. `class` or `id`; 5) if the HTML contains URLs which don't need prefixing, e.g. URLs already pointing to a different domain. – IMSoP Jan 29 '14 at 17:46
  • @IMSoP I realise that, which is why I did remind OP about that he actually shouldn't do this. I'll improve my regex a bit though. – Joeytje50 Jan 29 '14 at 17:51
  • Yes, a regex solution can still be a lot better than this one. Problems 1, 2, and 3 are pretty trivial to fix; 4 is a little trickier given you don't want to assume `src` is the *first* attribute. 5 is the hardest, but you could use a negative assertion to ignore attributes beginning `http`, or use `preg_replace_callback` to have the substitution run through a callback function. – IMSoP Jan 29 '14 at 17:55
  • @IMSoP I think I've covered those 5 problems quite well. Is there any way you can think of it would still break? – Joeytje50 Jan 29 '14 at 18:02
  • Nice. I'm sure there are still edge-cases - HTML is an incredibly forgiving language, so there's lots of ways to write the same thing - but as regexes go, that's probably not a bad bet. – IMSoP Jan 29 '14 at 18:05
  • @IMSoP yeah, like that 'most referred to answer on SO' said, it's always better to use parsers, but indeed for some situations, regex can be a relatively functional alternative. – Joeytje50 Jan 29 '14 at 18:08
4

If you absolutely have to use regular expressions instead of a DOM parser, you could use this.

Not sure where DOC_ROOT is coming from though, since it's not a valid PHP variable (maybe a constant?). Also be aware that you won't be able to use an embedded variable inside the string if you have single quotes.

You probably want something more like:

img.*?src=['"](.*?)['"]

Replacing with:

img src="$_SERVER['DOCUMENT_ROOT']$1"

Which converts:

echo "<img src='anything.jpg'>"; //into:
echo "<img src='$_SERVER[\'DOCUMENT_ROOT\']/anything.jpg'>";

http://regex101.com/r/vN7lN9

In php, the code would look like this:

$string = "<img src='anything.jpg'>";
echo preg_replace('/img.*?src=[\'\"](.*?)[\'\"]/', "img src='$_SERVER[DOCUMENT_ROOT]/$1'", $string);

Be warned that if your DOM contains irregular HTML (a tag misplaced here and there, spaces between the = sign) you're liable to end up causing a lot of problems. That's where a DOM parser like comes in handy.

brandonscript
  • 68,675
  • 32
  • 163
  • 220
  • 1
    A bareword like `DOC_ROOT` denotes a [constant](http://php.net/manual/en/language.constants.php). Your, example, however, contains an invalid constant `DOCUMENT_ROOT` - you should be quoting the key, as `$_SERVER['DOCUMENT_ROOT']`. – IMSoP Jan 29 '14 at 17:48
  • @IMSoP I was writing it in the string - just hadn't included that bit in the example. It's fixed now. – brandonscript Jan 29 '14 at 17:50
  • I was referring to this line: `img src='$_SERVER[DOCUMENT_ROOT]$1'` - you have no quotes around `DOCUMENT_ROOT`. – IMSoP Jan 29 '14 at 17:51
  • Also, while the regex seems sound, it's not very clear in this answer how you actually use it with PHP, since none of the lines of code you show is valid PHP on its own. – IMSoP Jan 29 '14 at 17:52
  • @IMSoP inside a string that's perfectly valid. Or it could have `{}` if you really wanted it to. As for the php implementation, there are a zillion SO questions and instructions on using preg_replace. – brandonscript Jan 29 '14 at 17:56
  • OK, [apparently you're right](http://3v4l.org/IkEYS). That's really horrible behaviour, IMHO, since that looks just like an undefined constant to me. :( – IMSoP Jan 29 '14 at 18:03
  • @IMSoP not necessarily -- it works inside of a string like that and saves the trouble of matching up and escaping " and '. So long as you're consistent, it's OK. – brandonscript Jan 29 '14 at 18:03
  • Yeah, it just seems awkward to me that embedding it inside the string changes the behaviour like that. And adding braces around it makes it behave like it would outside the string again: http://3v4l.org/vb2q2 – IMSoP Jan 29 '14 at 18:07
  • @IMSoP hm, yeah - intriguing breakdown. Having never actually found a need to redefine a built-in variable, I've never experienced the phenomenon ;) When using and referencing arrays, I always use squiggly {}. – brandonscript Jan 29 '14 at 18:11
1

A lot of people state the importance of using a DOM parser, but too few answers actually demonstrate how to execute the task.

Regex, even when tempting to write a one-liner or to change a single character, is unsuitable for parsing html because it is DOM-ignorant -- it treats your input as a string and nothing more. I've crafted a demonstration of how regex (from the accepted answer) will make unintended replacements.

Code: (Demo)

$html = <<<HTML
<p>Some random text <img src="anything.jpg"> text <iframe data-whoops="<img" src="anything.jpg"></iframe></p>
HTML;

define('DOC_ROOT', 'www.example.com/');

echo "With regex:\n";
echo preg_replace('/<img([^>]*)src=["\']([^"\'\\/][^"\']*)["\']/', '<img\1src="'.DOC_ROOT.'\2"', $html);

echo "\n\n---\n\nWith a parser:\n";

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('img') as $img) {
    $img->setAttribute('src', DOC_ROOT . $img->getAttribute('src'));
}
echo $dom->saveHTML();

Output:

With regex:
<p>Some random text <img src="www.example.com/anything.jpg"> text <iframe data-whoops="<img" src="www.example.com/anything.jpg"></iframe></p>

---

With a parser:
<p>Some random text <img src="www.example.com/anything.jpg"> text <iframe data-whoops="&lt;img" src="anything.jpg"></iframe></p>

If you need to make conditional replacements on an img tag's url, there are additional tools like a url parser or Xpath that can be implemented to serve your requirements.

https://stackoverflow.com/a/60263813/2943403

Ultimately, my advice is to forget about how many lines of code you write; just write robust/reliable code.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
-1

That's what you are looking for, i think:

$pictureName = 'anything.jpg';

$html = str_replace($pictureName, DOC_ROOT.$pictureName, $html);
dincan
  • 99
  • 1
  • 8
  • 1
    The "anything.jpg" was just an example. I want every image src to get DOC_ROOT in front of it. Thanks though :) – Jonan Jan 29 '14 at 17:41