-1

Edit: None of the supplied links attempt to answer the question of negative lookbehinds not behaving as I would expect, given documentation I've read: https://www.regular-expressions.info/lookaround.html, https://regexr.com/, https://www.pcre.org/original/doc/html/pcrepattern.html#lookbehind

I am attempting to create a regex that will pull naked URL(s) out of a body of text ignoring those within an <a> tag. I have gotten the URL working but when I get down to trying to cancel out results with a negative lookbehind, it is not behaving as I expect. To test what would be found, the second pattern below proves that the preceding pattern match does find href=" or href=' but converting it from a positive lookbehind to a negative lookbehind does not cancel the results. I am sure it is something I am doing, and any help/feedback is greatly appreciated.

For testing, I am using (note the global g difference): https://regexr.com /pattern/gi PHP /pattern/i

Match full URL (works): ((?:\bhttps?:\/\/){0,1}((?:[a-z0-9.\-]+[.][a-z]{2,4}))(?:[^\s()<>{}\[\]]*|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Match full URL for only URL's in <a> tag (works): ((?<=(\bhref=("|')))(?:\bhttps?:\/\/){0,1}((?:[a-z0-9.\-]+[.][a-z]{2,4}))(?:[^\s()<>{}\[\]]*|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Match full URL for only URL's not in <a> tag (does not work): ((?<!(\bhref=("|')))(?:\bhttps?:\/\/){0,1}((?:[a-z0-9.\-]+[.][a-z]{2,4}))(?:[^\s()<>{}\[\]]*|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

Body of text being searched through:

Ignore: <a href='google.com'>Google</a>
Convert: https://google.com
Ignore: <a href="https://google.com">Google</a>
Convert: google.com?argu=ment
Ignore: <a href="google.com?argu=ment">Google</a>
Convert: google.com/path/to/script.php
Ignore: <a href='google.com/path/to/script.php'>Google</a>
Convert: https://google.com/path/to/script.php
Ignore: <a href='https://google.com/path/to/script.php'>Google</a>
Convert: google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: https://google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="https://google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
Convert: https://google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="https://google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
Convert: www.google.com
Ignore: <a href="www.google.com">Google</a>
Convert: https://www.google.com
Ignore: <a href="https://www.google.com">Google</a>
Convert: www.google.com?argu=ment
Ignore: <a href="www.google.com?argu=ment">Google</a>
Convert: www.google.com/path/to/script.php
Ignore: <a href="www.google.com/path/to/script.php">Google</a>
Convert: https://www.google.com/path/to/script.php
Ignore: <a href="https://www.google.com/path/to/script.php">Google</a>
Convert: www.google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="www.google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: https://www.google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="https://www.google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: www.google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="www.google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
Convert: https://www.google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="https://www.google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>```

PHP from the test script I am using (WARNING: completely unfiltered inputs; do not share a public URL with this code):
<?php

$input_default = <<<INPUT_DEFAULT
Convert: google.com
Ignore: <a href='google.com'>Google</a>
Convert: https://google.com
Ignore: <a href="https://google.com">Google</a>
Convert: google.com?argu=ment
Ignore: <a href="google.com?argu=ment">Google</a>
Convert: google.com/path/to/script.php
Ignore: <a href='google.com/path/to/script.php'>Google</a>
Convert: https://google.com/path/to/script.php
Ignore: <a href='https://google.com/path/to/script.php'>Google</a>
Convert: google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: https://google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="https://google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
Convert: https://google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="https://google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
Convert: www.google.com
Ignore: <a href="www.google.com">Google</a>
Convert: https://www.google.com
Ignore: <a href="https://www.google.com">Google</a>
Convert: www.google.com?argu=ment
Ignore: <a href="www.google.com?argu=ment">Google</a>
Convert: www.google.com/path/to/script.php
Ignore: <a href="www.google.com/path/to/script.php">Google</a>
Convert: https://www.google.com/path/to/script.php
Ignore: <a href="https://www.google.com/path/to/script.php">Google</a>
Convert: www.google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="www.google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: https://www.google.com/path/to/script.php?argu=ment&and=more
Ignore: <a href="https://www.google.com/path/to/script.php?argu=ment&and=more">Google</a>
Convert: www.google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="www.google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
Convert: https://www.google.com/path/to/script.php?argu=ment&and=more#hash
Ignore: <a href="https://www.google.com/path/to/script.php?argu=ment&and=more#hash">Google</a>
INPUT_DEFAULT;
$regex_default = <<<REGEX_DEFAULT
((?:https?:\/\/){0,1}((?:[a-z0-9.\-]+[.][a-z]{2,4}))(?:[^\s()<>{}\[\]]*|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
REGEX_DEFAULT;
$replace_with_default = '<a target="__" href="$1">$1</a>';
$input = (!empty($_POST['input']) ? $_POST['input'] : $input_default);
$regex = (!empty($_POST['regex']) ? $_POST['regex'] : $regex_default);
$replace_with = (!empty($_POST['replace_with']) ? $_POST['replace_with'] : $replace_with_default);

?>
<form method="POST" action="<?php echo $_SERVER['SCRIPT_NAME']; ?>">
input:<br />
<textarea name="input" style="width: 100%;" rows="10"><?php echo($input); ?></textarea>
<br />
regex:
<table width="100%"><tr>
<td>/</td><td width="100%"><textarea name="regex" style="width: 100%;" rows="1"><?php echo($regex); ?></textarea></td><td>/i</td>
</tr></table>
replace with:<br />
<textarea name="replace_with" style="width: 100%;" rows="1"><?php echo($replace_with); ?></textarea>
<input type="submit" value="Convert »" />
</form>
<?php
if(!empty($_POST)) {
    $replaced_text = preg_replace('/'.$regex.'/i', $replace_with, $input);
    echo('<hr />converted:<br /><textarea style="width: 100%;" rows="20">'.$replaced_text.'</textarea><hr />'.nl2br($replaced_text).'<hr />');
}
psyjoniz
  • 80
  • 1
  • 7

0 Answers0