Extract e-mail addresses from .txt files in python

Question

I would like to parse out e-mail addresses from several text files in Python. In a first attempt, I tried to get the following element that includes an e-mail address from a list of strings ('2To whom correspondence should be addressed. E-mail: joachim+pnas@uci.edu.\n').

When I try to find the list element that includes the e-mail address via i.find("@") == 0 it does not give me the content[i]. Am I misunderstanding the .find() function? Is there a better way to do this?

from os import listdir

TextFileList = []
PathInput = "C:/Users/p282705/Desktop/PythonProjects/ExtractingEmailList/text/"

# Count the number of different files you have!
for filename in listdir(PathInput):
    if filename.endswith(".txt"):  # In case you accidentally put other files in directory
        TextFileList.append(filename)

for i in TextFileList:
    file = open(PathInput + i, 'r')
    content = file.readlines()
    file.close()

for i in content:
    if i.find("@") == 0:
        print(i)

The find function returns an index of the found string. If you are just looking to see if the string contains an '@' then i.find("@") != -1, since -1 means that there are no "@" in the string — APorter1031, Jan 08 '18 at 16:13

Matheus Portela · Answer 1 · 2018-01-08T16:37:13.880

The standard way of checking whether a string contains a character, in Python, is using the in operator. In your case, that would be:

for i in content:
    if "@" in i:
        print(i)

The find method, as you where using, returns the position where the @ character is located, starting at 0, as described in the Python official documentation.

For instance, in the string abc@google.com, it will return 3. In case the character is not located, it will return -1. The equivalent code would be:

for i in content:
    if i.find("@") != -1:
        print(i)

However, this is considered unpythonic and the in operator usage is preferred.

score 1 · Answer 2 · answered Jan 08 '18 at 16:23

1

Find returns the index if you find the substring you are searching for. This isn't correct for what you are trying to do.

You would be better using a Regular Expression or RE to search for an occurence of @. In your case, you may come into as situation where there are more than one email address per line (Again I don't know your input data so I can't take a guess)

Something along these lines would benefit you:

import re
for i in content:
    findEmail = re.search(r'[\w\.-]+@[\w\.-]+', i)
    if findEmail:
     print(findEmail.group(0))

You would need to adjust this for valid email addresses... I'm not entirely sure if you can have symbols like +...

answered Jan 08 '18 at 16:23

Jack Nicholson

197
6

Thank you, problem solved! Do you have an idea how to check whether it is a invalid e-mail? – PROgrammer Jan 09 '18 at 10:59
Just to note that this solution will only work for 1 email per string, so if you have multiple emails it will only return the first one. Checking for invalid emails is a problem as you can never be sure whether or not it's a typo by the user, an email bounces back or something else. Really there's no point in checking as there are so many factors outwith your control. Unless you have a specific set of rules you want to follow, I wouldn't bother checking... Best to ensure there is an "@" followed by at least one "." – Jack Nicholson Jan 09 '18 at 11:02
This may help you to update the regex: https://stackoverflow.com/questions/2049502/what-characters-are-allowed-in-an-email-address – Jack Nicholson Jan 09 '18 at 11:05

score 0 · Answer 3 · answered Jan 08 '18 at 16:13

'Find' function in python returns the index number of that character in a string. Maybe you can try this?

list = i.split(' ') # To split the string in words
for x in list:    # search each word in list for @ character
    if x.find("@") != -1:
        print(x)

Extract e-mail addresses from .txt files in python

3 Answers3