Replace protocol in links that don't match a given domain

Question

I'm stuck at one point of replacing just the protocol of links inside a text, when the given domain doesn't match:

Test case:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam <a title="mytitle" href="https://www.other-domain.de/path/index.html" target="_blank">other domain</a> nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd <a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">my domain</a>, no sea takimata <a title="mytitle" href="https://www.other-domain.de/path2/index2.html" target="_blank">other domain</a> est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed <a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">my domain</a> voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

Regex so far:

$content = preg_replace('/<a (.*?)href=[\"\'](.*?)\/\/(.*?)[\"\'](.*?)>(.*?)<\/a>/i', '<a href="http://$3">$5</a>', $content);

However, this matches all links -- my goal is to only apply the replacement to links which do not match a given domain, eg "my-domain.de" in my case.

That is to say -- only links that don't match the given domain should have their protocol changed from "https" to "http".

Cheers Marek

**Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Sep 19 '13 at 12:26
Hi Lester, i don´t wanna parse the content, i just need to replace some stuff. It´s like getting text, replacing, put text back. — magic.77, Sep 19 '13 at 13:10
**You *are* asking to parse the HTML.** If you're trying to find parts of the HTML structure, then you're parsing. You may be trying to get by by faking it with regular expressions, but you're still parsing, and regular expressions are not up to the task. See here for why: http://htmlparsing.com/regexes — Andy Lester, Sep 19 '13 at 13:17
so then give me a better solution - btw: this happens inside a Wordpress Modul where i need to replace https with http inside a text - and give it back to wordpress. for me there is no need to build up additional php html parser modules — magic.77, Sep 19 '13 at 13:30
I gave you links to two pages with better solutions in my original comment. — Andy Lester, Sep 19 '13 at 13:50
nice - but it´s a framework that i would have to include into wordpress, get it running and this for just replacing some characters. wow - but too much for this. maybe a very good solution using it standalone. — magic.77, Sep 19 '13 at 14:13
"Just replacing some characters" greatly understates the complexity of what it is that you are trying to achieve. — Andy Lester, Sep 19 '13 at 14:27
hi Lester, sure you´re right. so breaking down the task again, gave me another solution without regex. i understand and support your post now well:-) — magic.77, Sep 20 '13 at 09:59

score 0 · Answer 1 · edited May 23 '17 at 12:29

For what it's worth, this is the regex that you're looking for:

Raw Match Pattern:

<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>

Raw Replace Pattern:

<a $1href="http://$2"$3>$4</a>

The PHP code is:

$content = preg_replace('/<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>/i','<a $1href="http://$2"$3>$4</a>',$content);

That being said, be forewarned -- to Andy Lester's point, this regex is not reliable. Though in my opinion, the issue is not quite "the nature of HTML", or at least not simply that. The point being made in this admittedly-great resource -- http://htmlparsing.com/regexes -- is that you're attempting to re-invent the wheel on a very bumpy road. The broader concern is "not that regular expressions are evil, per se, but that overuse of regular expressions is evil." That quote is by Jeff Atwood, from an exceptional elaboration on the joy and terror of regular expressions here: Regular Expressions: Now You Have Two Problems (He also has an article specifically warning against using regular expressions to parse HTML -- Parsing Html The Cthulhu Way.)

Specifically in the case of my "solution" above, for instance -- the following input (with line returns) will not be matched, despite being valid HTML:

<a title="mytitle"
href="https://www.other-domain.de/path/index.html" 
target="_blank">other domain</a>

The following inputs, however, are handled as desired:

<a href="https://my-domain.de">my domain</a>
<a href="https://other-domain.de">other domain</a>

<a href="https://www.my-domain.de/path/index.html">my domain</a>
<a href="https://www.other-domain.de/path/index.html">other domain</a>

<a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">other domain</a>
<a title="my title" href="https://www.other-domain.de/path/index.html" target="_blank">my domain</a>

becomes:

<a href="https://my-domain.de">my domain</a>
<a href="http://other-domain.de">other domain</a>

<a href="https://www.my-domain.de/path/index.html">my domain</a>
<a href="http://www.other-domain.de/path/index.html">other domain</a>

<a title="other title" href="https://www.my-domain.de/path/index.html" target="_blank">other domain</a>
<a title="my title" href="http://www.other-domain.de/path/index.html" target="_blank">my domain</a>

A great resource for explaining the full breakdown of the regex is here: http://www.myregextester.com/index.php

To replicate the test on that tool:

select the "replace" operation
put your regex into "match pattern"
put the replacment into "replace pattern"
select the "i" flag checkbox
select the "explain" checkbox
select the "PHP" checkbox
put your target content into "source text"
click "Submit"

For convenience and posterity, I've included the full explanation provided by that tool below, but two of the conceptual highlights are:

Lookaheads and negative lookaheads -- eg (?!text) http://php.net/manual/en/regexp.reference.assertions.php
Non-capturing subpatterns -- eg (?:text) or the outer part of (?:(?!text)) http://php.net/manual/en/regexp.reference.subpatterns.php

Match Pattern Explanation:

The regular expression:

`(?i-msx:<a ((?:(?!href).)*?)href=[\"\']https:\/\/((?:(?!my-domain.de).)*?)[\"\'](.*?)>(.*?)<\/a>)`

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?i-msx:                 group, but do not capture (case-insensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  <a                       '<a '
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        href                     'href'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character except \n
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  href=                    'href='
----------------------------------------------------------------------
  [\"\']                   any character of: '\"', '\''
----------------------------------------------------------------------
  https:                   'https:'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        my-domain                'my-domain'
----------------------------------------------------------------------
        .                        any character except \n
----------------------------------------------------------------------
        de                       'de'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      .                        any character except \n
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  [\"\']                   any character of: '\"', '\''
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  \/                       '/'
----------------------------------------------------------------------
  a>                       'a>'
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

PS: A prettier regex tool can be found here, permalink to the solution plus original input sample -- http://regex101.com/r/xE8eP4 — DreadPirateShawn, Oct 11 '13 at 06:27

Replace protocol in links that don't match a given domain

1 Answers1