I'm looking at a python script that goes through a database and finds the valid phone numbers. The part of the script that is relevant to my question looks like this:
country_code = ""
(Then further down):
for i, entry in enumerate(feed.entry):
for phone in entry.phone_number:
# Strip out any non numeric characters
phone.text = re.sub('\D', '', phone.text)
phone.text = unicode(phone.text, 'utf8')
# Remove leading digit if it exists, we will add this again later for all numbers
# Only if a country code is defined.
if country_code != "":
phone.text = re.sub('^\+?%s' % country_code, '', phone.text)
First question is, I've seen two different variations of this script, one uses the last line I have shown above while the other uses this:
phone.text = re.compile('^\+?%s' % country_code, '', phone.text)
I am wondering what the difference is, and whether one is more correct than the other.
But the real issue is that this database is only supposed to contain North American numbers shown as 10 or 11 digits, but what I need is ten digit numbers only. Unfortunately this script returns any phone number it finds. So after that last line runs, what I'd like to see happen is this:
If the phone.text string is less than ten characters or more than 11 characters in length, set it to null.
Then
If the phone.text string is 11 characters in length then if first character is "1", strip the leading "1" leaving the final 10 characters. If the first character is NOT "1" then set it to null.
Then
If the phone.text string is 10 characters (in other words, not null, since it should be either 10 characters or null at this point) then check to see that the first digit and the fourth digit are in the range 1-9. If either are not in that range, set the string to null.
I know absolutely zilch about python so I was hoping maybe someone could show me how to do this, since it seems like something that should be relatively easy. By the way the documentation (such as it is) says that if you put a "1" in the country code field it will add it to numbers that don't have it now, which doesn't seem right to me (I can't see how that would happen) but also it's the exact opposite of what I want in this particular case. Thanks!
EDIT: I did not want to use a REGEX because to be honest, this is not a time-sensitive application (I can afford to waste CPU cycles) and I really have trouble deciphering Regular Expressions - they only barely make sense to me.
I did find that if, after the line
phone.text = unicode(phone.text, 'utf8')
I added these lines:
if len(phone.text) == 11 and phone.text[0] == '1':
#international code
phone.text = phone.text[1:]
if len(phone.text) < 10:
phone.text = ""
if len(phone.text) > 10:
phone.text = phone.text[:10]
Then it would pretty much do what I needed. As I noted the database does not contain any international numbers, so that consideration is absent. It does contain a few numbers with extensions following, but I only needed the 10 digit primary numbers, not the extensions, so anything after the first ten digits is chopped off. The above was actually deciphered after looking a the suggested REGEX-based solutions, where someone suggested that a regex isn't always the best way to do things, an sentiment with which I wholeheartedly agree. Besides, while there were a lot of REGEX's shown to parse phone numbers in various ways, most of those answers just assumed you know how to use a REGEX in Python and I don't, given my total lack of familiarity with the language.
Thanks for the suggestions.