-1

I have a file with some random text including some random emails. I am trying to write a code that can find out emails with valid domains(gmail.com, outlook.com, hotmail.com) using Regex.

Here is the code I've tried so far:

import requests
import re
email_re= r'[a-zA-Z0-9_.-]+[^!#$%^&*()]@[gmail|hotmail|outlook]+[.com]+'
with open ('emails.txt') as f:
    read = f.read()
email_data=re.findall(email_re,read)
print(email_data)

My email.txt file:

sentences in it  pythonprac@dummy.com
test@gmail.com
test1@hotmail.com
The post A rough draft for a 5 paragraph essay and then a final draft. appeared first on EssayBishop.
hello@gm.com

Required Output: test@gmail.com, test1@hotmail.com

Jordan P
  • 31
  • 1
  • 1
  • 7
  • 1
    Does this answer your question? [Extract email sub-strings from large document](https://stackoverflow.com/questions/17681670/extract-email-sub-strings-from-large-document) – Bipul singh kashyap Oct 12 '20 at 07:44
  • That really depends a lot on how accurate you want to be, because email addresses can have wildl different formats (all correct, according to specifications), see: https://emailregex.com/ – ChatterOne Oct 12 '20 at 07:45
  • Maybe read the [Stack Overflow `regex` tag info page](/tags/regex/info) before posting questions; it covers several common beginner mistakes. – tripleee Oct 12 '20 at 07:48
  • To check if an email is valid (i.e. syntax plus valid domain by checking dns) you can use [pyIsEmail](https://github.com/michaelherold/pyIsEmail). Answers from how to [Extract email sub-strings from large document](https://stackoverflow.com/questions/17681670/extract-email-sub-strings-from-large-document) shows how to obtain potential email addresses which can then be validated with pyIsEmail. – DarrylG Oct 12 '20 at 07:59

2 Answers2

0

In regex, this part [gmail|hotmail|outlook]+ essentially means 'match one or more of any of these characters: g,m,a,i,l,h,o,t,k,u,|. What you need is a regex group (?:...) like this: r'[a-zA-Z0-9_.-]+[^!#$%^&*()]@(?:gmail|hotmail|outlook)\.com' And because the . in the .com means any character followed by com, you need to escape it with \

mananony
  • 537
  • 5
  • 10
0

Try this regex

import re
email_re = re.findall(r'[\w\.-]+@[\w\.-]+', file)