0

I've got a string like this:

Google is a <a href="http://hi.hi?xxx&yyy&zzz">web&amp;search engine</a>.

I want to replace & with &amp; only within links, as needed by W3C validator:

Google is a <a href="http://hi.hi?xxx&amp;yyy&amp;zzz">web&amp;search engine</a>.

Could you suggest a regexp for that? Thanks!

Dmitry Isaev
  • 3,888
  • 2
  • 37
  • 49

3 Answers3

1

The official correct answer is that you should not use a regex to parse HTML. Instead, take a look at HTML-parsing libraries. This question covers your options:

How do you parse and process HTML/XML in PHP?

I suggest taking this approach. Once you use a tool like DOM to parse the HTML, you can use a simple regex to perform your replacement within the links. People will be glad to help if you have trouble.

If you do insist on using a regex for this (and it can be ok in some limited cases where the HTML content is under your control) just search this site, and you will find tons of questions in which people show how to do this.

Community
  • 1
  • 1
dan1111
  • 6,576
  • 2
  • 18
  • 29
1

As dan1111 noted, regexes are a brittle tool for this at best. The next problem is that you would need variable-length lookbehind assertions to get to some degree of reliability that makes me feel comfortable.

That said, it may well be that the following works well enough for you - give it a try on some data that you've backed up before:

$result = preg_replace('/&(?=[^<>]*>)/', '&amp;', $subject);

This replaces an & only if the next angle bracket is a closing angle bracket.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
0

you can use a lookahead and lookbehind.

&(?<=\<a\s(href).*)(?=.*\"\>)

What this does is look for all & preceded with < href and any characters, and also any character followed by a "> and. When I tested on RegexHero it selected only the & within the link itself.

Nick
  • 4,302
  • 2
  • 24
  • 38