0

I will be gathering scattered emails from a larger CSV file. I am just now learning regex. I am trying to extract the emails from this example sentence. However, emails is populating with only the @ symbol and the letter immediately before that. Can you help me see what's going wrong?

import re

String = "'Jessica's email is jessica@gmail.com, and Daniel's email is daniel123@gmail.com. Edward's is edwardfountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'"

emails = re.findall(r'.[@]', String)
names = re.findall(r'[A-Z][a-z]*',String)

print(emails)
print(names)
cs95
  • 379,657
  • 97
  • 704
  • 746
EwokHugz
  • 353
  • 2
  • 8
  • 15

4 Answers4

5

your regex e-mail is not working at all: emails = re.findall(r'.[@]', String) matches anychar then @.

I would try a different approach: match the sentences and extract name,e-mails couples with the following empiric assumptions (if your text changes too much, that would break the logic)

  • all names are followed by 's" and is somewhere (using non-greedy .*? to match all that is in between
  • \w matches any alphanum char (or underscore), and only one dot for domain (else it matches the final dot of the sentence)

code:

import re

String = "'Jessica's email is jessica@gmail.com, and Daniel's email is daniel123@gmail.com. Edward's is edwardfountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'"

print(re.findall("(\w+)'s.*? is (\w+@\w+\.\w+)",String))

result:

[('Jessica', 'jessica@gmail.com'), ('Daniel', 'daniel123@gmail.com'), ('Edward', 'edwardfountain@gmail.com'), ('Oscar', 'odawg@gmail.com')]

converting to dict would even give you a dictionary name => address:

{'Oscar': 'odawg@gmail.com', 'Jessica': 'jessica@gmail.com', 'Daniel': 'daniel123@gmail.com', 'Edward': 'edwardfountain@gmail.com'}

The general case needs more chars (not sure I'm exhaustive):

String = "'Jessica's email is jessica_123@gmail.com, and Daniel's email is daniel-123@gmail.com. Edward's is edward.fountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'"

print(re.findall("(\w+)'s.*? is ([\w\-.]+@[\w\-.]+\.[\w\-]+)",String))

result:

[('Jessica', 'jessica_123@gmail.com'), ('Daniel', 'daniel-123@gmail.com'), ('Edward', 'edward.fountain@gmail.com'), ('Oscar', 'odawg@gmail.com')]
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
2

1. Emails

    In [1382]: re.findall(r'\S+@\w+\.\w+', text)
    Out[1382]: 
    ['jessica@gmail.com',
     'daniel123@gmail.com',
     'edwardfountain@gmail.com',
     'odawg@gmail.com']

How it works: All emails are xxx@xxx.xxx. One thing to note is a bunch of characters surrounding @, and the singular .. So, we use \S to demarcate anything that is not a whitespace. And + is to search for 1 or more such characters. \w+\.\w+ is just a fancy way of saying search for a string that only has one . in it.


2. Names

    In [1375]: re.findall('[A-Z][\S]+(?=\')', text)
    Out[1375]: ['Jessica', 'Daniel', 'Edward', 'Oscar']

How it works: Any word starting with an upper case. The (?=\') is a lookahead. As you see, all names follow the pattern Name's. We want everything before the apostrophe. Hence, the lookahead, which is not captured.


Now, if you want to map names to emails by capturing them together with one massive regex, you can. Jean-François Fabre's answer is a good start. But I recommend getting the basics down par first.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • `odawg@gmail.com.`: last e-mail has a dot in the end, others get the commas as well :) – Jean-François Fabre Aug 04 '17 at 05:07
  • @Jean-FrançoisFabre [Last 24 hours](https://puu.sh/x1aUC/32ef6e3aea.png). Yeah it's been a tough day -- tough crowd to please. – cs95 Aug 04 '17 at 05:24
  • @Jean-FrançoisFabre Honestly I start exercising my hammer when I need to blow off some steam -- which has been happening quite often actually. – cs95 Aug 04 '17 at 05:29
  • 470 and you're complaining? I admit a lot of users don't accept answers (or accept the lamest/least upvoted answer there is because they understand it) – Jean-François Fabre Aug 04 '17 at 05:55
  • I just realised that `[\w_]` is redundant since `\w` contains `_` ! – Jean-François Fabre Mar 16 '18 at 15:53
  • @Jean-FrançoisFabre random much? ;) but you're right, I guess I can edit when I'm at a PC. – cs95 Mar 16 '18 at 15:59
  • doesn't hurt. Now I realize that we should have closed as a duplicate of "how to extract emails from string", but I felt I had to explain OP _why_ his attempt doesn't work. – Jean-François Fabre Mar 16 '18 at 16:09
  • @Jean-FrançoisFabre For some reason, I did close it... and I reopened it again... not sure why though :D – cs95 Mar 16 '18 at 17:01
1

You need to find anchors, patterns to match. An improved pattern could be:

import re

String = "'Jessica's email is jessica@gmail.com, and Daniel's email is 
daniel123@gmail.com. Edward's is edwardfountain@gmail.com, and his 
grandfather, Oscar's, is odawg@gmail.com.'"

emails = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+', String)
names = re.findall(r'[A-Z][a-z]*', String)

print(emails) 
print(names)

\w+ is missing '-' which are allowed in email adresses.

furoscame
  • 119
  • 1
  • 7
0

This is because you are not using the repeat operator. The below code uses the + operator which means the characters / sub patterns just before it can repeat 1 to many times.

s = '''Jessica's email is jessica@gmail.com, and Daniel's email is daniel123@gmail.com. Edward's is edwardfountain@gmail.com, and his grandfather, Oscar's, is odawg@gmail.com.'''

p = r'[a-z0-9]+@[a-z]+\.[a-z]+'
ans = re.findall(p, s)

print(ans)
Anonta
  • 2,500
  • 2
  • 15
  • 25