0

I'm looking at a python script that goes through a database and finds the valid phone numbers. The part of the script that is relevant to my question looks like this:

    country_code = ""

(Then further down):

    for i, entry in enumerate(feed.entry):
        for phone in entry.phone_number:
            # Strip out any non numeric characters
            phone.text = re.sub('\D', '', phone.text)
            phone.text = unicode(phone.text, 'utf8')

            # Remove leading digit if it exists, we will add this again later for all numbers
            # Only if a country code is defined.
            if country_code != "":
                phone.text = re.sub('^\+?%s' % country_code, '', phone.text)

First question is, I've seen two different variations of this script, one uses the last line I have shown above while the other uses this:

                phone.text = re.compile('^\+?%s' % country_code, '', phone.text)

I am wondering what the difference is, and whether one is more correct than the other.

But the real issue is that this database is only supposed to contain North American numbers shown as 10 or 11 digits, but what I need is ten digit numbers only. Unfortunately this script returns any phone number it finds. So after that last line runs, what I'd like to see happen is this:

If the phone.text string is less than ten characters or more than 11 characters in length, set it to null.

Then

If the phone.text string is 11 characters in length then if first character is "1", strip the leading "1" leaving the final 10 characters. If the first character is NOT "1" then set it to null.

Then

If the phone.text string is 10 characters (in other words, not null, since it should be either 10 characters or null at this point) then check to see that the first digit and the fourth digit are in the range 1-9. If either are not in that range, set the string to null.

I know absolutely zilch about python so I was hoping maybe someone could show me how to do this, since it seems like something that should be relatively easy. By the way the documentation (such as it is) says that if you put a "1" in the country code field it will add it to numbers that don't have it now, which doesn't seem right to me (I can't see how that would happen) but also it's the exact opposite of what I want in this particular case. Thanks!

EDIT: I did not want to use a REGEX because to be honest, this is not a time-sensitive application (I can afford to waste CPU cycles) and I really have trouble deciphering Regular Expressions - they only barely make sense to me.

I did find that if, after the line

            phone.text = unicode(phone.text, 'utf8')

I added these lines:

            if len(phone.text) == 11 and phone.text[0] == '1':
                #international code
                phone.text = phone.text[1:]
            if len(phone.text) < 10:
                phone.text = ""
            if len(phone.text) > 10:
                phone.text = phone.text[:10]

Then it would pretty much do what I needed. As I noted the database does not contain any international numbers, so that consideration is absent. It does contain a few numbers with extensions following, but I only needed the 10 digit primary numbers, not the extensions, so anything after the first ten digits is chopped off. The above was actually deciphered after looking a the suggested REGEX-based solutions, where someone suggested that a regex isn't always the best way to do things, an sentiment with which I wholeheartedly agree. Besides, while there were a lot of REGEX's shown to parse phone numbers in various ways, most of those answers just assumed you know how to use a REGEX in Python and I don't, given my total lack of familiarity with the language.

Thanks for the suggestions.

Skyviewer
  • 41
  • 1
  • 8
  • 2
    The list of related questions beside your question have provided an umpteen number of solutions. Do check if any of the, have solved your problem. – Bhargav Rao Mar 27 '15 at 18:12
  • possible duplicate of [A comprehensive regex for phone number validation](http://stackoverflow.com/questions/123559/a-comprehensive-regex-for-phone-number-validation) – Tui Popenoe Mar 27 '15 at 18:44

1 Answers1

1

To answer your first question:

re.compile returns a regex object. If you're going to use the same regex pattern in multiple spots, it's better to create a reusable regex object with re.compile.

re.sub(pattern, repl, string, count=0, flags=0) is just shorthand for:

re.compile(pattern, flags).sub(repl, string, count)
OozeMeister
  • 4,638
  • 1
  • 23
  • 31
  • Thank you. So I'm assuming that if this is only used once in the code, re.sub is correct? Also, you seem to imply the arguments are formatted differently for each of those, but in the examples I gave above the arguments were the same. That said, it's mostly an academic question, because I figured out that that part of the code never gets executed anyway in my application. But I do appreciate the clarification! – Skyviewer Mar 29 '15 at 00:01
  • That is correct. The regex object that is returned from `re.compile` doesn't need the pattern string passed in anymore because it gets stored in the regex object and can be accessed via `my_regex.pattern`. The second call to `re.compile` is completely wrong and would fail if it were ever called (I'm assuming the original author meant `re.sub`). – OozeMeister Mar 29 '15 at 16:15
  • I had a feeling that other one was wrong. Thanks again for the info! – Skyviewer Mar 30 '15 at 22:36