1

So I am looking to validate user input phone numbers.

So far I have been doing so with Regex. But with different phone number formats from all around the world it's been getting hard to maintain the Regex.

Since I have a lot of datasets of valid phone numbers I figured it might be possible to use a machine learning algorithm.

Because I don't have any prior experience with machine learning, I tried to prototype it by using scikitlearn SVM. It didn't work.

Now I'm curios if this is even a good use case for a machine learning algorithm. If it is, what are some resources I should lookup? If not, what are some alternatives to machine learning to create a easy to extend phone number validation?

Joshu
  • 29
  • 3
  • 2
    I think its not machine learning problem and even if you create one it will simply guess for which country this number belongs, I don't think you want to guess... a simple country-wise regex lookup table or an phone validations api will be far accurate and easy – sagarr Aug 16 '17 at 06:04

3 Answers3

1

This is a case of mere computer programming, you probably need to refactor your code into some kind of a class that's responsible for validating phone numbers from different countries.

Also from a regex perspective, the question of updating it for international phone numbers have been asked here: What regular expression will match valid international phone numbers? and the best answer is to use the following regex:

\+(9[976]\d|8[987530]\d|6[987]\d|5[90]\d|42\d|3[875]\d|
2[98654321]\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|
4[987654310]|3[9643210]|2[70]|7|1)\d{1,14}$

Regarding machine learning, here's a nice summary of what questions machine learning can answer, which can be summarized in the following list:

  1. Is this A or B?
  2. Is this weird?
  3. How much/how many?
  4. How is it organized?
  5. What should I do next?

Check the blog article (there is also a video within the article) for more details. Your question doesn't really fit in any of the above five categories.

Mohamed Ali JAMAOUI
  • 14,275
  • 14
  • 73
  • 117
0

International phone number rules are immensely complicated so it's unlikely a regex will work. Training a machine learning algorithm could potentially work if you have enough data, but there are some weird edge cases and formatting variables (including multiple ways of expressing the same phone number) that would make life difficult.

A better option is to use Google's libphonenumber. It's an open source phone number validation library implemented in C++ and Java, with ports for quite a few other languages.

Gorcha
  • 151
  • 1
  • 4
-1

The given task is Syntax-restricted + subject to Regulatory procedures

Machine Learning would need such a super-set training DataSET, so as to meet the ( Hoeffding's Inequality constrained ) projected error-rate, which is for low level targets by far principally ( almost ) impossible to arrange to train at.

So even the regex-tools are ( almost ) guessing, as the terminal parts of the E.164-"address" are ( almost ) un-maintainable for the global address-space.

Probabilistic ML-learners may get somewhat sense for being harnessed here, but again - these will even knowingly guess ( with a comfort of providing a working estimate of a confidence level achieved by each and every such guess ).

Why?

Because each telephone number ( and here we do not assume the lexical irregularities and similar cosmetic details ) must be conform both the a global set of regulations ( ITU-T governed ), then -- on a lower level -- subject to national set of regulations ( multi-party governed ), and finally there are two distinct phone-number E.164-"address"-assignment procedures, not make the story a bit easier.


An ITU-T RFC 4725 - brief view:

just to realise the [ ITU-T [, NNPA [, CSP [, <privateAdmin> ]]]]-hierarchy of distributed rules, introduced into an ( absolute syntax - distributed governance in ) E.164 number-blocks analyses ( down to an individual number ).

RFC 4725              ENUM Validation Architecture         November 2006


   These two variants of E.164 number assignment are depicted in
   Figure 2:

   +--------------------------------------------+
   | International Telecommunication Union (ITU)|
   +--------------------------------------------+
                        |
              Country codes (e.g., +44)
                        |
                        v
    +-------------------------------------------+
    | National Number Plan Administrator (NNPA) |------------+
    +-------------------------------------------+            |
                        |                                    |
                  Number Ranges                              |
            (e.g., +44 20 7946 xxxx)                         |
                        |                                    |
                        v                                    |
      +--------------------------------------+               |
      | Communication Service Provider (CSP) |               |
      +--------------------------------------+               |
                        |                                    |
                        |                              Single Numbers
              Either Single Numbers              (e.g., +44 909 8790879)
                 or Number Blocks                       (Variant 2)
     (e.g., +44 20 7946 0999, +44 20 7946 07xx)              |
                   (Variant 1)                               |
                        |                                    |
                        v                                    |
                  +----------+                               |
                  | Assignee |<------------------------------+
                  +----------+

                     Figure 2: E.164 Number Assignment

   (Note: Numbers above are "drama" numbers and are shown for
   illustrative purpose only.  Assignment polices for similar "real"
   numbers in country code +44 may differ.)

   As the Assignee (subscriber) data associated with an E.164 number is
   the primary source of number assignment information, the NAE usually
   holds the authoritative information required to confirm the
   assignment.

   A CSP that acts as NAE (indirect assignment) may therefore easily
   assert the E.164 number assignment for its subscribers.  In some
   cases, such CSPs operate database(s) containing service information
   on their subscribers' numbers. 
user3666197
  • 1
  • 6
  • 50
  • 92