How to scrape valid emails from a file using Regex in Python?

Question

I have a file with some random text including some random emails. I am trying to write a code that can find out emails with valid domains(gmail.com, outlook.com, hotmail.com) using Regex.

Here is the code I've tried so far:

import requests
import re
email_re= r'[a-zA-Z0-9_.-]+[^!#$%^&*()]@[gmail|hotmail|outlook]+[.com]+'
with open ('emails.txt') as f:
    read = f.read()
email_data=re.findall(email_re,read)
print(email_data)

My email.txt file:

sentences in it  pythonprac@dummy.com
test@gmail.com
test1@hotmail.com
The post A rough draft for a 5 paragraph essay and then a final draft. appeared first on EssayBishop.
hello@gm.com

Required Output: test@gmail.com, test1@hotmail.com

Does this answer your question? [Extract email sub-strings from large document](https://stackoverflow.com/questions/17681670/extract-email-sub-strings-from-large-document) — Bipul singh kashyap, Oct 12 '20 at 07:44
That really depends a lot on how accurate you want to be, because email addresses can have wildl different formats (all correct, according to specifications), see: https://emailregex.com/ — ChatterOne, Oct 12 '20 at 07:45
Maybe read the [Stack Overflow `regex` tag info page](/tags/regex/info) before posting questions; it covers several common beginner mistakes. — tripleee, Oct 12 '20 at 07:48
To check if an email is valid (i.e. syntax plus valid domain by checking dns) you can use [pyIsEmail](https://github.com/michaelherold/pyIsEmail). Answers from how to [Extract email sub-strings from large document](https://stackoverflow.com/questions/17681670/extract-email-sub-strings-from-large-document) shows how to obtain potential email addresses which can then be validated with pyIsEmail. — DarrylG, Oct 12 '20 at 07:59

score 0 · Accepted Answer · answered Oct 12 '20 at 07:46

In regex, this part [gmail|hotmail|outlook]+ essentially means 'match one or more of any of these characters: g,m,a,i,l,h,o,t,k,u,|. What you need is a regex group (?:...) like this: r'[a-zA-Z0-9_.-]+[^!#$%^&*()]@(?:gmail|hotmail|outlook)\.com' And because the . in the .com means any character followed by com, you need to escape it with \

score 0 · Answer 2 · answered Oct 12 '20 at 07:47

0

Try this regex

import re
email_re = re.findall(r'[\w\.-]+@[\w\.-]+', file)

answered Oct 12 '20 at 07:47

Bipul singh kashyap

515
5
19

How to scrape valid emails from a file using Regex in Python?

2 Answers2