The easiest way is to create a blacklist of sites using an alternation
in combination with a (*SKIP)(*FAIL)
.
This way the engine moves past the offending urls and cannot backtrack.
(?:<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])(?:(?!\1)[\S\s])*?(?:www\.test\.com|test\.com)(?:(?!\1)[\S\s])*?\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>.*?</a\s*>(*SKIP)(*FAIL)|<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])([\S\s]*?)\2))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>(.*?)</a\s*>)
https://regex101.com/r/hpwUr3/1
The stuff you want is:
- Group 3 = url
- Group 4 = content
Explained
(?:
# Begin Offender Anchor tag
< a
(?= \s )
(?= # Asserttion for: href (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s href \s* = \s*
(?:
( ['"] ) # (1)
(?:
(?! \1 )
[\S\s]
)*?
(?: # Add more offenders here
www \. test \. com
| test \. com
)
(?:
(?! \1 )
[\S\s]
)*?
\1
)
)
# Have the href offendeer, just match the rest of tag
\s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
> # End tag
.*?
</a \s* >
(*SKIP) (*FAIL) # Move past the offender
|
# Begin Good Anchor tag
< a
(?= \s )
(?= # Asserttion for: href (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s href \s* = \s*
(?:
( ['"] ) # (2)
( [\S\s]*? ) # (3), Good link
\2
)
)
# Have the href good one, just match the rest of tag
\s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
> # End tag
( .*? ) # (4), Content
</a \s* >
)