0

I have a method I use to clean up the output from user submitted data. I can pass options to either allow or disallow URLs and emails independently. I had it working fine in the past until just now when I used it with URLs disallowed and emails allowed. The problem is that the regex I use to block URLs is also blocking the domain on email addresses. How can I block URLs and domains, but only if they are not part of an email address?

My existing code;

// email address removal
if ( ! ISSET($options['email']) || $options['email'] === FALSE) {
    $pattern = "/[^@\s]*@[^@\s]*\.[^@\s]*/";
    $replacement = '<span class="muted">*</span>';
    $string = preg_replace($pattern, $replacement, $string);
}
// url - link removal
if ( ! ISSET($options['url']) || $options['url'] === FALSE) {
    $pattern = "/[a-zA-Z]*[:\/\/]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i";
    $replacement = '<span class="muted">**</span>';
    $string = preg_replace($pattern, $replacement, $string);
}
Ally
  • 955
  • 1
  • 15
  • 33
  • Why are you not using a regex for checking if a string matches an email address? This would include filtering out all URLs. – Johannes Apr 14 '19 at 18:07
  • Please use a tried and true regex to match on URLs instead: https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url. Then your problem will disappear. – Vqf5mG96cSTT Apr 14 '19 at 18:41
  • @Johannes Thanks, can you give an example of such a regex? If so I'll accept your answer! – Ally Apr 14 '19 at 22:31

2 Answers2

1

If you are working with PHP, a good way to validate if a string is an email address is filter_var() (see PHP filter_var. This function will return the filtered value or FALSE if the filter fails (no valid email address).

$filtered = filter_var($email_string, FILTER_VALIDATE_EMAIL);
if(!empty($filtered )) {
  // valid email address
} else {
  // not a valid email address
}

There are some more filters available: https://www.php.net/manual/en/filter.filters.php

In case you want use regex to validate your email address you can take a look at this example: https://regex101.com/r/aG8fB6/2 They are using this regex in order to validate email addresses:

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z]+

In PHP, you can use preg_match to check a string against a regex (PHP preg_match).

Johannes
  • 1,478
  • 1
  • 12
  • 28
  • Thanks for your answer, but I don't think it will work in this scenario. I am not validating but removing unwanted content from a string. If a user submitted a string that contained text, multiple email addresses, URLs and domains, your solution would not remove the unwanted URLs and domains while leaving email addresses and text intact, I don't think. filter_var would only match true if the entire string was an email address. – Ally Apr 15 '19 at 12:52
  • For removing unwanted content (e.g. `;`) from an email address you can use the `FILTER_SANITIZE_EMAIL` option for `filter_var`. Is it not possible to give the user some feedback that the email address isn't valid and they need to change something? It is very hard to handle every possible case of wrong user input. – Johannes Apr 15 '19 at 13:32
0

What you could do is verify using a negative lookaround assertions that what is on the left (?<!\S) side and what is on the right side (?!\S) is not a non whitespace character.

A slightly updated version of your pattern could be:

(?<!\S)[a-zA-Z]*[:/]*[\w-]+\.+[\w:./%&=?-]+(?!\S)

Regex demo

Note that you don't have to escape the forward slash if you use another delimiter than / like ~, the hyphen - could be moved to the start or the end of the character class to not escape it and the dot . does not have to be escaped in the character class.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • This looks like it could be the ticket although I found it matched 'and...' (or any word followed by more than one period) for some reason. I will try it out when I'm back at my desk later. – Ally Apr 15 '19 at 13:02
  • @Ally That is because of the dot in the character class. You might take it out and then repeat the dot followed by 1+ times what is in the character class. https://regex101.com/r/XAR3Rd/2 You can make the regex more specific if you want, it depends on what you would and would not allow to match. – The fourth bird Apr 15 '19 at 13:05