0

I'm currently reading the book "Automate the boring stuff with Python" but got stucked in a line of the code in the project from CH7. I just cannot understand the author's logic here.

The problem can be found at the end. Project: Phone Number and Email Address Extractor. https://automatetheboringstuff.com/chapter7

The project outline is:

Your phone and email address extractor will need to do the following:

-Gets the text off the clipboard.

-Finds all phone numbers and email addresses in the text.

-Pastes them onto the clipboard.

Here's the code:

import re, pyperclip

#extracts phone number
phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?               # area code  -> either 561 or  (561)
    (\s|-|\.)?                       # separator  (if there is)
    (\d{3})                          # first 3 digits
    (\s|-|\.)                        # separator
    (\d{4})                          # last 4 digits
    (\s*(ext|x|ext.)\s*(\d{2,5}))?   # extension
    )''', re.VERBOSE)

#extracts email
emailRegex= re.compile(r'''(
    [a-zA-Z0-9._%+-]+                # username
    @                                # @symbol
    [a-zA-Z0-0._%+-]+                # domain name
    (\.[a-zA-Z]{2,4})                # dot something
    )''',re.VERBOSE)

# find matches in clipboard text.
text = str(pyperclip.paste())               #paste all string in to 'text' string
matches = []
for groups in phoneRegex.findall(text):            
    phoneNum= '-'.join([groups[1],groups[3],groups[5]])   #group 1- > area code, group 2-> separation, group 3 -> 699 etc
    if groups[8] != ' ':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)

for groups in emailRegex.findall(text):
    matches.append(groups[0])

#Copy results to the clipboard. (our new string)

if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers of email addresses found.')

Where I'm stucked is in this segment:

for groups in phoneRegex.findall(text):            
        phoneNum= '-'.join([groups[1],groups[3],groups[5]])   #area code, first 3 digits, last 4 digits of phone number 
        if groups[8] != ' ':
            phoneNum += ' x' + groups[8]
        matches.append(phoneNum)

The author explains that these are the area code, first 3 digits, and last 4 digits that was extracted from the phone number:

groups[1],groups[3],groups[5]

But this doesn't make sense to me. Notice that this for loop iterates through each element, 'groups' is not the whole list, its just one element of the list. So, groups[1] would be the second digit of the first element, not the actual element.

Just to illustrate my problem better, here's another example:

num= re.compile(r'(\d+)')
for groups in num.findall('Extract all 23 numbers 444 from 2414 at, 1'):
     print(groups)

output:

23
444
2414
1
for groups in num.findall('Extract all 23 numbers 444 from 2414 at, 1'):
    print(groups[0])

output:

2
4
2
1

So groups[0] is not the element, just the a digit of the element.
Hopefully this makes sense, because I'm having a lot of trouble understanding his reasoning. Any help would be appreciated.

UPDATE: Seems like groups[0] is the first element of the tupple

num= re.compile(r'(\d+)\D+(\d+)\D+(\d+)')
for groups in  num.findall('Extract all 23 numbers 444 from 2414 at, 10,434,555'):
    groups[0]

output:

23
10
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
tadm123
  • 8,294
  • 7
  • 28
  • 44
  • 2
    Run your experiment with more than one group in the regex to see the difference, then read the [`re.findall` documentation](https://docs.python.org/2/library/re.html#re.findall). – user2357112 Oct 03 '16 at 04:53
  • both the regex for phone numbers as for email are completely bogus and will fail to recognise valid email addresses and most phone numbers. E.g. the local part of an email can have [many more characters](http://stackoverflow.com/a/2049510/1307905). About 5% of the phone numbers in my address book would match the phone regex. If the book doesn't explicitly mention that deficiency, I would not trust it for other things. – Anthon Oct 03 '16 at 05:05
  • was scratching my head for hours. Just needed a bit of guidance, I now see the logic. @user2357112 Thanks a lot. – tadm123 Oct 03 '16 at 05:17
  • @Anthon> the book he's following is meant as an introduction to python, not as a comprehensive [grammar of email addresses](http://tools.ietf.org/html/rfc5322#section-3.4). – spectras Oct 03 '16 at 05:47
  • @spectras It doesn't have to be complete, it should just explicitly and clearly state that it isn't. If it doesn't the author will bread another generation of programmers that at some point have to unlearn that every phone number is 10 digits long after some customer complains. – Anthon Oct 03 '16 at 06:17
  • Sorry, I have just one last question. Wouldn't group[0] be the area code in the code, not group[1]? It seems like the indexes for each group should be one less than they are. – tadm123 Oct 03 '16 at 21:28
  • I Updated and included another example. – tadm123 Oct 03 '16 at 21:49
  • 1
    Try it with the actual regex they used, and you'll find that `groups[1]` is the area code. (You missed a group.) – user2357112 Oct 03 '16 at 22:22
  • 1
    In the initial code `groups` is the list of all found phone numbers. `groups[0]` would be the entire matched number (see the first parenthesis). `groups[1]` is the area code – OneCricketeer Oct 03 '16 at 22:23
  • Yes, seems like there's 2 parenthesis at the beginning, I overlooked that small detail, thanks again for pointing it out guys. – tadm123 Oct 03 '16 at 22:32

1 Answers1

0

the findall() always return a list of tuple, and you can return each tuple one by one using the for loop!

for groups in phoneRegex.findall(text):            
    phoneNum= '-'.join([groups[1],groups[3],groups[5]]) 
    print(groups) #you can add one more line to check it out

the result is:

('800.420.7240', '800', '.', '420', '.', '7240', '', '', '') #one of the tuple in groups

The fist group(group(0)) of each match will be the entire regex:

>>>phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
>>>mo = phoneNumRegex.search('My number is 415-555-4242.')
>>>mo.group(0)
'415-555-4242'
Handcho
  • 3
  • 2