0

While researching regex info online, I found this wonderful php function:

function convert($input) {
   $pattern = '@(http(s)?://)?(([a-zA-Z0-9])([-\w]+\.)+([^\s\.]+[^\s]*)+[^,.\s])@';
   return $output = preg_replace($pattern, '<a href="http$2://$3">$0</a>', $input);
}

I was posted here on SO by Gero Nikolov in THIS POST

It converts text links to clickable links, as many as you have in a paragraph, spot on, it works great.

However, and here's my question, when this code encounters an email address, it omits the name before the email's domain and also the @ like so: name@domain.com I'm guessing the return part of the function would have to be amended to include the email link, but I'm not certain as to how this would be integrated into the above function. Would a conditional statement be necessary? Regex is still new to me so I'm not sure if this would be the right direction to pursue.

Then it gets weird when a link is at the end of a sentence and that sentence is wrapped in single or double quotes (as in a press release or article). What happens there is that the end punctuation (period, exclamation, question mark, etc.) is included in the link along with the closing quote, like this: "Make sure to check out their website at domain.com." If the period is removed, the closing quote is still included in the link: "Make sure to check out their website at domain.com"

Things get even weirder when the last word of a regular sentence (not including any link text) lies within single or double quotes. For example, the last word with its punctuation and the closing quote are both turned into a link as such: "This is the first sentence. This is the second sentence. And this is the last sentence." If the closing punctuation (period, etc.) is removed, no link is created and the closing quote is unaffected: "This is the first sentence. This is the second sentence. And this is the last sentence"

So again, the question is, how should this function be modified to handle email addresses and also to stop adding ending punctuation and closing quotes if a link is at the end of a quoted sentence? Additionally, how can it be modified to prevent turning the last word with its ending punctuation into a link when it is in a quoted sentence?

I'm still new to regex work so any help, links, nudges in the right direction(s) are appreciated. There's one further thing that this function does, it makes links out of phone numbers separated by dots (as in 555.555.5555 gets turned into 555.555.5555) but I think that problem can wait for now. Again, I'm not asking for a rewritten code snippet, but some direction would be greatly appreciated as I have no idea where to begin modifying this. Thanks!

wordman
  • 581
  • 2
  • 6
  • 20
  • That is a regex for url. Since the device part is optional, it's going to pick up all the junk it finds. –  Dec 19 '18 at 21:00
  • @sln Yes, you are right. Now how do we clean it up and how do we get it to recognize email addresses? – wordman Dec 19 '18 at 21:36
  • We spend about 6 hours writing a 500 character regex. I charge money for that. –  Dec 19 '18 at 22:10
  • @sln I understand. I did ask above for any helpful suggestions on where to learn more so I can modify this code. – wordman Dec 19 '18 at 22:17
  • The code is junk, don't use it. –  Dec 19 '18 at 22:21
  • Now that's something I can use, I appreciate that. I'll have to keep searching for something better. – wordman Dec 19 '18 at 22:22
  • The biggest problem is that it is searching for url's substrings in the middle of text. I'll give you a couple regex to help get you started, hang on. –  Dec 19 '18 at 22:23
  • URL: ^(?!mailto:)(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?@)?(?:(?:(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))|localhost)(?::\d{2,5})?(?:\/[^\s]*)?$ –  Dec 19 '18 at 22:26
  • That's a handful right there, thank you. I'll start working with it. Many thanks! – wordman Dec 19 '18 at 22:28
  • EMAIL: (?i)^(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)?[0-9a-z]@))(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9]))$ –  Dec 19 '18 at 22:29
  • EMAIL RFC5322: (?im)^(?=.{1,64}@)(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:(?=.{1,63}\.)[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\w]*))$ –  Dec 19 '18 at 22:29
  • See https://regex101.com/r/ObS3QZ/1 for RFC5322 email. –  Dec 19 '18 at 22:34
  • Very kind, thank you! – wordman Dec 19 '18 at 22:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/185517/discussion-between-wordman-and-sln). – wordman Dec 19 '18 at 22:48

0 Answers0