Extract email sub-strings from large document

Question

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

...<name@domain.com>...

What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

0x90 · Accepted Answer · 2023-02-16T23:52:16.563

152

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re
>>> line = "should we use regex more often? let me know at  jdsk@bob.com.lol"
>>> match = re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', line)
>>> match.group(0)
'jdsk@bob.com.lol'

If you have several email addresses use findall:

>>> line = "should we use regex more often? let me know at  jdsk@bob.com.lol or popop@coco.com"
>>> match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', line)
>>> match
['jdsk@bob.com.lol', 'popop@coco.com']

The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.

Edit: as suggested in a comment by @kostek: In the string Contact us at support@example.com. my regex returns support@example.com. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)

Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+which will capture example@do-main.com as well.

Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad@ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."

Update 2023 Seems stackabuse has compiled a post based on the popular SO answer mentioned above.

import re

regex = re.compile(r"([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\"([]!#-[^-~ \t]|(\\[\t -~]))+\")@([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|\[[\t -Z^-~]*])")

def isValid(email):
    if re.fullmatch(regex, email):
        print("Valid email")
    else:
        print("Invalid email")

isValid("name.surname@gmail.com")
isValid("anonymous123@yahoo.co.uk")
isValid("anonymous123@...uk")
isValid("...@domain.us")

edited Feb 16 '23 at 23:52

answered Jul 16 '13 at 16:20

0x90

39,472
36
165
245

1

xyz+44@gmail.com doesn't get caught. – bad_keypoints Jun 13 '16 at 11:51
5

according to this regex 'bad@ss' is a valid email address ;) – nischi Jan 12 '17 at 13:51
1

In string `Contact us at support@example.com.` this regex returns `support@example.com.` (with dot at the end). To avoid this, use `[\w\.,]+@[\w\.,]+\.\w+)`. – kostek Feb 12 '17 at 18:39
7

`[\w\.,]+@[\w\.,]+\.\w+` does not match `example@do-main.com` which is a valid email address. So it should be `[\w\.-]+@[\w\.-]+\.\w+` – Hieu Apr 01 '17 at 09:20
1

@kostek with your regex `Contact us at support@example.com.Or try +33600000000` extracts `support@example.com.Or` – J. Doe Aug 31 '17 at 08:58
1

@J.Doe - That's true, but that's also the expected behavior. `support@example.com.Or` is technically a valid email address, and it's not properly delimited, so it's returning what it's supposed to. – Pikamander2 Oct 03 '19 at 02:44
A `+` should be added to the first character class in the regex because email addresses like `hikingfan+friends@gmail.com` are valid and actually somewhat widely used. The person who owns `hikingfan@gmail.com` can use arbitrary text after `hikingfan+` for various personal purposes. – Stephen Jul 08 '21 at 16:49
1

Here is my improved regex: `re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')`. In addition to allowing `+` in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like `abc.co.uk` as well, and does NOT match `bad@ss` :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that. I haven't spent a lot of time thinking about this, but I do believe it's an improvement. It won't be perfect of course. – Stephen Jul 08 '21 at 17:13

score 11 · Answer 2 · edited Oct 12 '20 at 08:36

You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.

import re
line = "why people don't know what regex are? let me know asdfal2@als.com, Users1@gmail.de " \
       "Dariush@dasd-asasdsa.com.lo,Dariush.lastName@someDomain.com"
match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
for i in match:
    print(i)

If you want to add it to a list just print the "match"

# this will print the list
    print(match)

david_adler · Answer 3 · 2023-02-07T16:02:03.350

11

import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)

Forgive me lord for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.

Feel free to play around with it here

I also made a variation where the regex captures emails like name at example.com

(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])

edited Feb 07 '23 at 16:02

answered Oct 02 '18 at 12:02

david_adler

9,690
6
57
97

4

I tried a bunch of regexes on different sites and this is the first one that actually just worked, kudos. – rosstex Jan 02 '20 at 04:04
3

Unfortunately, this expression can result in catastrophic backtracking: https://regex101.com/r/AwW89g/1 – Pikamander2 May 15 '20 at 06:22

score 4 · Answer 4 · answered Jul 16 '13 at 16:36

If you're looking for a specific domain:

>>> import re
>>> text = "this is an email la@test.com, it will be matched, x@y.com will not, and test@test.com will"
>>> match = re.findall(r'[\w-\._\+%]+@test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la@test.com', 'test@test.com']

Palash Jhamb · Answer 5 · 2020-01-17T10:59:05.430

1

import re

reg_pat = r'\S+@\S+\.\S+'

test_text = 'xyz.byc@cfg-jj.com    ir_er@cu.co.kl   uiufubvcbuw bvkw  ko@com    m@urice'   

emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)

Output:

['xyz.byc@cfg-jj.com', 'ir_er@cu.co.kl']

edited Jan 17 '20 at 10:59

answered Jan 17 '20 at 10:53

Palash Jhamb

605
6
15

score 0 · Answer 6 · answered Jul 26 '19 at 07:15

import re
mess = '''Jawadahmed@gmail.com Ahmed@gmail.com
            abc@gmail'''
email = re.compile(r'([\w\.-]+@gmail.com)')
result= email.findall(mess)

if(result != None):
    print(result)

The above code will help to you and bring the Gmail, email only after calling it.

score 0 · Answer 7 · answered Jan 29 '20 at 06:59

0

You can use \b at the end to get the correct email to define ending of the email.

The regex

[\w\.\-]+@[\w\-\.]+\b

answered Jan 29 '20 at 06:59

Rishang

83
1
6

score 0 · Answer 8 · answered Nov 18 '20 at 13:59

Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:

>>> str1 = "abcdef_12345@gmail.com"
>>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$"
>>> re_com = re.compile(regex1)
>>> re_match = re_com.search(str1)
>>> re_match
<_sre.SRE_Match object at 0x1063c9ac0>
>>> re_match.group(0)
'abcdef_12345@gmail.com'

score 0 · Answer 9 · answered Jan 30 '21 at 09:33

0

content = ' abcdabcd jcopelan@nyx.cs.du.edu  afgh 65882@mimsy.umd.edu  qwertyuiop mangoe@cs.umd'

match_objects = re.findall(r'\w+@\w+[\.\w+]+', content)

answered Jan 30 '21 at 09:33

Sandeep Maurya

58
1
9

Farzad Amirjavid · Answer 10 · 2021-07-03T15:07:53.940

#    \b[\w|\.]+   ---> means begins with any english and number character or dot.

import re

marks = '''

!()[]{};?#$%:'"\,/^&é*

'''

text = 'Hello from priyankv@gmail.com to python@gmail.com, datascience@@gmail.com and machinelearning@@yahoo..com wrong email address: farzad@google.commmm'
# list of sequences of characters:
text_pieces = text.split()
pattern = r'\b[a-zA-Z]{1}[\w|\.]*@[\w|\.]+\.[a-zA-Z]{2,3}$'
for p in text_pieces:
  for x in marks:
    p = p.replace(x, "") 
  if len(re.findall(pattern, p)) > 0:
    print(re.findall(pattern, p))

score 0 · Answer 11 · answered Sep 17 '22 at 12:18

One other way is to divide it into 3 different groups and capture the group(0). See below:

emails=[]
for line in email: # email is the text file where some emails exist. 
    e=re.search(r'([.\w\d-]+)(@)([.\w\d-]+)',line) # 3 different groups are composed. 
    if e:
        emails.append(e.group(0))

print(emails)

score -1 · Answer 12 · answered Jan 12 '17 at 15:00

Here's another approach for this specific problem, with a regex from emailregex.com:

text = "blabla <hello@world.com>><123@123.at> <huhu@fake> bla bla <myname@some-domain.pt>"

# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text)  # ['<hello@world.com>', '<123@123.at>', '<huhu@fake>', '<myname@somedomain.edu>']

# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails   # ['hello@world.com', '123@123.at', 'myname@some-domain.pt']

score -1 · Answer 13 · edited Nov 10 '18 at 12:59

-1

import re 
txt = 'hello from absc@gmail.com to par1@yahoo.com about the meeting @2PM'
email  =re.findall('\S+@\S+',s)
print(email)

Printed output:

['absc@gmail.com', 'par1@yahoo.com']

edited Nov 10 '18 at 12:59

MBT

21,733
19
84
102

answered Nov 10 '18 at 10:05

Ayoub EL MAJJODI

161
1
10

score -1 · Answer 14 · answered Apr 17 '19 at 12:00

-1

import re
with open("file_name",'r') as f:
    s = f.read()
    result = re.findall(r'\S+@\S+',s)
    for r in result:
        print(r)

answered Apr 17 '19 at 12:00

Laksh Jadhwani

7
3

This code works for getting the email-ids from a file – Laksh Jadhwani Apr 17 '19 at 12:01
1

... as well as, for example, `@@@`.` – tripleee Oct 12 '20 at 08:38

Extract email sub-strings from large document

14 Answers14

Linked

Related