3

On one of my PHP sites, I use this regular expression to automatically remove phone numbers from strings:

$text = preg_replace('/\+?[0-9][0-9()-\s+]{4,20}[0-9]/', '[removed]', $text);

However, when users post long URL's that contain several numbers as part of their text, the URL also gets affected by the preg_replace, which breaks the URL.

How can I ensure the above preg_replace does not alter URLs contained in $text?

EDIT:

As requested, here is an example of a URL being broken by the preg_replace above:

$text = 'Please help me with my question here: https://stackoverflow.com/questions/20589314/  Thanks!';
$text = preg_replace('/\+?[0-9][0-9()-\s+]{4,20}[0-9]/', '[removed]', $text);
echo $text; 

//echoes: Please help me with my question here: https://stackoverflow.com/questions/[removed]/ Thanks!
Community
  • 1
  • 1
ProgrammerGirl
  • 3,157
  • 7
  • 45
  • 82

3 Answers3

2

I think you have to parse the url AND the phone number, like /(?: url \K | phone number)/ - sln
@sln: How would I do that? If it helps, there is a URL regex here: stackoverflow.com/a/8234912/869849 – ProgrammerGirl

Here is an example using the provided regex for url and phone num:

Php test case

 $text = 'Please help me with my +44-83848-1234 question here: http://stackoverflow.com/+44-83848-1234questions/20589314/ phone #:+44-83848-1234-Thanks!';
 $str = preg_replace_callback('~((?:(?:[a-zA-Z]{3,9}:(?://)?)(?:[;:&=+$,\w-]+@)?[a-zA-Z0-9.-]+|(?:www\.|[;:&=+$,\w-]+@)[a-zA-Z0-9.-]+)(?:(?:/[+\~%/.\w-]*)?\??[+=&;%@.\w-]*\#?\w*)?)|(\+?[0-9][0-9()\s+-]{4,20}[0-9])~',
                   function( $matches ){
                        if ( $matches[1] != "" ) {
                             return $matches[1];
                        }
                        return '[removed]';
                   },
                   $text);

 print $str;

Output >>

 Please help me with my [removed] question here: http://stackoverflow.com/+44-83848-1234questions/20589314/ phone #:[removed]-Thanks!

Regex, processed with RegexFormat

 # '~((?:(?:[a-zA-Z]{3,9}:(?://)?)(?:[;:&=+$,\w-]+@)?[a-zA-Z0-9.-]+|(?:www\.|[;:&=+$,\w-]+@)[a-zA-Z0-9.-]+)(?:(?:/[+\~%/.\w-]*)?\??[+=&;%@.\w-]*\#?\w*)?)|(\+?[0-9][0-9()\s+-]{4,20}[0-9])~'

     (                                  # (1 start), URL
          (?:
               (?:
                    [a-zA-Z]{3,9} :
                    (?: // )?
               )
               (?: [;:&=+$,\w-]+ @ )?
               [a-zA-Z0-9.-]+ 
            |  
               (?: www \. | [;:&=+$,\w-]+ @ )
               [a-zA-Z0-9.-]+ 
          )
          (?:
               (?: / [+~%/.\w-]* )?
               \??
               [+=&;%@.\w-]* 
               \#?
               \w* 
          )?
     )                                  # (1 end)
  |  
     (                                  # (2 start), Phone Num
          \+? 
          [0-9] 
          [0-9()\s+-]{4,20} 
          [0-9] 
     )                                  # (2 end)
  • Very interesting, thank you! Is there a way to do this using just 1 line of `preg_replace`? – ProgrammerGirl Dec 17 '13 at 10:31
  • Instead of 1 line of `preg_replace_callback`? Depends on what the replacement is. As I said earlier, preg_replace `/(?: url \K | phone number)/` with "". –  Dec 17 '13 at 16:05
  • I tried what you had mentioned in your comment, and it correctly ignores URL's, however, it then appends "[removed]" to the end of the URL's. Do you know how to fix that? – ProgrammerGirl Dec 18 '13 at 21:23
  • There is the dilema. If you replace with the empty string, it could be done with a simple `preg_replace`. The URL must be consumed independently to pass by it because the phone number is a subset of it. There is no practical way to use assertions in this case. Within regex engines, a callback is a simple extra function call, really an imperceptable amount of overhead. If you want to get the job done, I suggest to use this method. –  Dec 19 '13 at 21:36
1

You should go with some more coding so rather than stroking your head, you'll go stroking your ego!

<?php
    $text = "This is my number20558789yes with no spaces
    and this is yours 254785961
    But this 20558474 is within http://stackoverflow.com/questions/20558474/
    So I don't remove it
    and this is another url http://stackoverflow.com/questions/20589314/ 
    Thanks!";
    $up = "(https?://[-.a-zA-Z0-9]+\.[a-zA-Z]{2,3}/\S*)"; // to catch urls
    $np = "(\+?[0-9][0-9()-\s+]{4,20}[0-9])"; // you know this pattern already
    preg_match_all("#{$up}|{$np}#", $text, $matches); // match all above patterns together ($matches[1] contains urls, $matches[2] contains numbers)
    preg_match_all("#{$np}#", print_r(array_filter($matches[1]), true), $urls_numbers); // extract numbers from urls, actually if we have any
    $diff = array_diff(array_filter($matches[2]), $urls_numbers[0]); // an array with numbers that we should replace
    $text = str_replace($diff, "[removed]", $text); // replacing
    echo $text; // here you are

And then The Output:

This is my number[removed]yes with no spaces
and this is yours [removed]
But this 20558474 is within http://stackoverflow.com/questions/20558474/
So I don't remove it
and this is another url http://stackoverflow.com/questions/20589314/ 
Thanks!
revo
  • 47,783
  • 14
  • 74
  • 117
0

Would it be fair to assume that phone numbers are often preceded either by whitespace or are at the start of a line? If so, this would stop you from changing URLs accidentally, since neither whitespace nor newlines ever exist in the middle of URLs:

$text = preg_replace('/(^|\s)\+?[0-9][0-9()-\s+]{4,20}[0-9]/', '[removed]', $text);
szxk
  • 1,769
  • 18
  • 35
  • The problem with your solution is that it can easily (and accidentally!) be circumvented by simply preceding a phone number with a letter. Ideally, I'm looking for a solution that will only ignore the regex if the sequence of numbers occurs inside a URL, but I have no idea how to do that. – ProgrammerGirl Dec 15 '13 at 00:06