5

How do I remove all email addresses and links from a string and replace them with "[removed]"

rubo77
  • 19,527
  • 31
  • 134
  • 226
JEagle
  • 147
  • 2
  • 2
  • 8

7 Answers7

25

You can use preg_replace to do it.

for emails:

$pattern = "/[^@\s]*@[^@\s]*\.[^@\s]*/";
$replacement = "[removed]";
preg_replace($pattern, $replacement, $string);

for urls:

$pattern = "/[a-zA-Z]*[:\/\/]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i";
$replacement = "[removed]";
preg_replace($pattern, $replacement, $string);

Resources

PHP manual entry: http://php.net/manual/en/function.preg-replace.php

Credit where credit is due: email regex taken from preg_match manpage, and URL regex taken from: http://www.weberdev.com/get_example-4227.html

Josiah
  • 4,754
  • 1
  • 20
  • 19
  • Can you post a small sample of the text? – Josiah Jul 21 '10 at 20:00
  • It was just a random text i had. Nothing specific, just some email address and some links – JEagle Jul 21 '10 at 20:08
  • That's not right. The regex for emails would not remove punctuation like :?#$% which is not allowed in valid email addresses. Regex must remove all characters except alpanumeric and period(.). Everything else (some other characters might also be allowed, but not all!) must be removed. – ZurabWeb Nov 08 '13 at 17:18
  • Thanks Its working. Can you suggest pattern to remove 10 digit mobile number – Deepak Jul 13 '16 at 20:46
  • Not working properly. Need to be improved. – Mutatos Mar 30 '21 at 16:47
  • @Deepak try out this code sample. preg_match_all('/[+027][0-9]{10}/',$string, $output). removes any phone number that starts with a 0 2 or 7 or + – stanley mbote Oct 06 '21 at 13:42
  • Why bloat your pattern with `A-Za-z` if you are going to use the `i` modifier? Do you know what `\w` means to the regex engine? – mickmackusa May 03 '23 at 13:32
6

Try this:

$patterns = array('<[\w.]+@[\w.]+>', '<\w{3,6}:(?:(?://)|(?:\\\\))[^\s]+>');
$matches = array('[email removed]', '[link removed]');
$newString = preg_replace($patterns, $matches, $stringToBeMatched);

Note: you can pass an array of patterns and matches into preg_replace instead of running it twice.

treeface
  • 13,270
  • 4
  • 51
  • 57
1

My answer is a variation of Josiah's /[^@\s]*@[^@\s]*\.[^@\s]*/ for emails, which works fine but also matches any puctuation after the email address itself: demo 1

Adapt the regex as follows /[^@\s]*@[^@\s\.]*\.[^@\s\.,!?]*/ to exclude . , ! and ?: demo 2

Abdul Aziz Barkat
  • 19,475
  • 3
  • 20
  • 33
Pingui
  • 1,312
  • 4
  • 15
  • 28
1

The answer I was going to upvote was deleted. It linked to a Linux Journal article Validate an E-Mail Address with PHP, the Right Way that points out what's wrong with almost every email regex anyone proposes.

The range of valid forms of an email address is much broader than most people think.

Stephen P
  • 14,422
  • 2
  • 43
  • 67
0

There are a lot of characters valid in the first local part of the email (see What characters are allowed in an email address?), so these lines would replace all valid email addresses:

<?php
$c='\w-'; // allowed characters in domainpart
$la=preg_quote('!#$%&\'*+/=?^_`{|}~', "/"); // additional allowed in first localpart
$email="[$c$la][$c$la\.]*[^.]@[$c]+\.[$c]+";
$t = preg_replace("/\b($email)\b/", '[removed]', $t);
// or with a link:
$t = preg_replace("/\b($email)\b/", '<a href="mailto:\1">\1</a>', $t);

#replace URLs
$t = preg_replace("/[htpsf]+:\/+[$c]+\.+[$c\.\/%&;+~=\?#]+/i", '[removed]', $t);

This will cover most valid email addresses, be informed: removing really only all valid email addresses is a bit more complex (see How can I validate an email address using a regular expression?)

rubo77
  • 19,527
  • 31
  • 134
  • 226
  • `[htpsftp]` and `[:\/\/]` reveals a lack of basic regex knowledge. A "character class" should only contain unique characters. This answer is dispensing bad advice. This question is concerned with redacting text, not creating HTML markup. – mickmackusa May 03 '23 at 13:35
  • I removed the obsolete letters in the Regex – rubo77 May 10 '23 at 17:19
  • `a` should be declared as variable `$a`. The following characters do not need escaping slashes when written into a character class: `!#$%&*+/=?^_{|}.` and the backtick. `$a` is the same as `$c` and can be condensed to `\w-`. If you are going to use the fullstring match when replacing, then there is reason for the capturing parentheses -- just remove them and use `\0` in the replacement. – mickmackusa May 10 '23 at 19:54
  • Thx. I added some of your hints and left some as it is for readability – rubo77 May 18 '23 at 14:13
  • `preg_quote()`ing all of those characters is unnecessary because they are used in a character class. The hyphen is redundant because of the contents of `$c`. Did you actually test your script in a sandbox? – mickmackusa May 18 '23 at 14:25
0

Pattern for Email (10x to @bromelio)

"/[^@\s]*@[^@\s\.]*\.[^@\s\.,!?]*/"

Pattern for Url

"#((?:https?|ftp)://\S+[[:alnum:]]/?)#si"
Mutatos
  • 1,675
  • 4
  • 25
  • 55
-1

My answer is a slight improvement of Josiah's code. Just want to combine the two code segment as one as the preg_replace() allow that the pattern can be passed as a string or as an array.

$patterns = array();

$patterns[0] = "/[^@\s]*@[^@\s]*\.[^@\s]*/"; //removes email

$patterns[1] = "/[a-zA-Z]*[:\/\/]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-  
_]+/i"; //removes any link


$replace =  "[removed]";

$string = "Follow the link below https://stackoverlow.com/testing/preg- 
match-replace-in-php or email me a sample code in my email 
test@mail.com";

preg_replace($pattern,s $replacement, $string); 

In the event, you want to use a different replacement text when a link is removed or the email for instance when the mail is removed you specify that [email has been removed] and [link has been removed] you can extend the above segment of the code more so on the $replacement as shown below

$replacements = array();
//replacementmessage for mails
$replacements[0] = "[Email has been removed]"; 
//replacementmessage for links
$replacements[1] = "[Link has been removed]";

And every other part of the code remains the same.

stanley mbote
  • 956
  • 1
  • 7
  • 17
  • `[:\/\/]` demonstrates a fundamental misunderstanding of how to use a "character class" -- it should never contain duplicated characters. The `i` pattern modifier is useless if you use `A-Za-z` everywhere. You are using excessive escaping in your character class as well. This answer is giving too much bad advice. – mickmackusa May 03 '23 at 13:41