3

I need to hide emails and phone number in a string. Replacing well formatted emails/number is easy with a regex, but what about other formats? Here is an example:

Input:

Email addresses likeemail@example.comoremail AT example DOT comshould be replaced. Phone numbers like347 323 4567ortree four seven, three two three four five six sevenshould also be replace.

Output:

Email addresses like(email hidden)or(email hidden)should be replaced. Phone numbers like(phone hidden)or(phone hidden)should also be replace.

AirBnB's messaging system is really good at doing that. Apparently they used to do that:

It looks for @ symbols, spellings of “this is me AT whatever DOT com” and series of numbers with at least 7 digits (telephone number) with some sensitivity to separators.

What would be the best way to do the same thing? Writing complex regexes? Using a natural language processing library?

TimPetricola
  • 1,491
  • 2
  • 12
  • 24
  • People spell out their numbers phonetically? On what planet? – tadman Jul 31 '14 at 20:59
  • @tadman On the planet where they absolutely want to give their phone number, but it should absolutely not happen :) For exampe, AirBnB is doing that in the messaging system to force people to use the platform until the reservation is done through the website. – TimPetricola Jul 31 '14 at 21:17
  • So what about "three hundred forty seven, three hundred twenty three, forr fivve sixx sevven"? – tadman Aug 05 '14 at 14:47
  • @tadman Yeah I know that there is always a way to get around. But the goal is to stop as much as possible. I know it is kind of stupid and hard but I'm just wondering how it's done. – TimPetricola Aug 05 '14 at 15:40
  • @HolgerEdwardWardlowSindbæk No, it seems that the best way is writing a lot of specific rules for every different cases and improving it over time. – TimPetricola Apr 15 '15 at 06:49

1 Answers1

4

This isn't going to be easy to do in code, and can have unpleasant consequences for your users, then your customer support people.

Phone numbers can be entered in a large number of formats if you allow for international numbers.

123-446-7890 could be a phone number, or it could be a simple subtraction like x=123-456-7890. Imagine how irritated your user will be when they get x=(phone hidden).

Email addresses are an even harder problem because they can vary in all sorts of ways. You can get the specification for email addresses by reading RFC 2822, and there's always the one in Perl's Mail::RFC822::Address module. While most people try to validate an address using a pattern, merely locating them can be ugly.

In either case, there are regex patterns that attempt to do it but they all fail when pushed hard.

To me this sounds like an ill-conceived idea made by an unknowing executive, similar to the request

Write a filter that removes all dirty words.

that I once received. (Yeah, right. From all written and spoken languages on earth, or merely man's desire to use such words?) It's easy to work around, and, for a lot of people will be a challenge just to defeat it.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303