1

I'm looking to create a regex which matches links which have no dots at the end. I know a FQDN always has the root dot at the end, but I'm working on a blog service. I need to process blog posts and apparently some useres finish their post with with a link and then a dot to finish their sentence.

Those texts look something like:

Example text... https://example.com/site. More text here...

The problem here is that this doesn't link to any webpage. With the help of this question I made this PHP function:

function modifyText($text) {
    $url = '/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/';
    $string= preg_replace($url, '<a href="$0" target="_blank">$0</a>', $text);
    return $string;
}

With the example from above this code generates

Example text... <a href="https://example.com/site." target="_blank">https://example.com/site.</a> More text here...

but it should generate

Example text... <a href="https://example.com/site" target="_blank">https://example.com/site</a>. More text here...

Ian Fako
  • 1,148
  • 1
  • 15
  • 34

2 Answers2

1

One option would be to, at the end, lazy-repeat non-space characters, and lookahead for zero or more .s, followed by a space or the end of the string:

'/https?:\/\/[a-z0-9.-]+\.[a-z]{2,3}(\/\S*?(?=\.*(?:\s|$)))?/i'

https://regex101.com/r/4VEWjW/2

Could also repeat dots followed by non-dots, to avoid being lazy:

'/https?:\/\/[a-z0-9.-]+\.[a-z]{2,3}(\/\.*[^.]+(?=\.*(?:\s|$)))?/i'
CertainPerformance
  • 356,069
  • 52
  • 309
  • 320
  • This solves this problem but I've encountered a new one here. TLD can have more than 2 or 3 characters (I've set it to 9 for now) and URL's like `(https://example.longtld)` match the bracket too – Ian Fako Apr 26 '19 at 10:01
  • Changing it to 9 indeed looks to work fine: https://regex101.com/r/4VEWjW/3 – CertainPerformance Apr 26 '19 at 10:02
  • Similar to the second pattern, use a character set to match `.)`s, and then a negative character set to match anything but `.)\s`s at the end: https://regex101.com/r/4VEWjW/5 – CertainPerformance Apr 26 '19 at 10:09
1

Another option is to use a negative lookbehind (?<!\.) after the \S to assert what is on the left is not a dot:

https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,3}(?:\/\S*(?<!\.))?

Regex demo | Php demo

If you don't need the capturing groups () you could turn them into non capturing groups (?:)

You don't have to escape the forward slash \/ if you use another delimiter than / for example ~

For example:

function modifyText($text) {
    $url = '~https?://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,3}(?:\/\S*(?<!\.))?~';
    $string= preg_replace($url, '<a href="$0" target="_blank">$0</a>', $text);
    return $string;
}

echo modifyText("Example text... https://example.com/site. More text here... https://example.com/site");

Result

Example text... <a href="https://example.com/site" target="_blank">https://example.com/site</a>. More text here... <a href="https://example.com/site" target="_blank">https://example.com/site</a>
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • Works for the problem, but I've encountered another problem here, see discussion under the other answer, here is the example https://regex101.com/r/HxyZWR/2 – Ian Fako Apr 26 '19 at 10:10
  • You could create a character class `[.)]` to check if what is at the end is not a dot or a closing parenthesis https://regex101.com/r/HxyZWR/3 – The fourth bird Apr 26 '19 at 10:12