How to extract list of unique email addresses from a file in Python

Question

I am trying to extract a list of unique email addresses from a .txt file (https://www.py4e.com/code3/mbox.txt) that contains multiple email messages. I am able to pull a list of email addresses by narrowing my search to the 'From:' and 'To:' lines with the below program:

import re
in_file = open('dummy_text_file.txt')
for line in in_file:
if re.findall('^From:.+@([^\.]*)\.', line):
    countFromEmail = countFromEmail + 1
    print(line)
if re.findall('^To:.+@([^\.]*)\.', line):
    print(line)

However, this does not provide me with a unique list as various of the email addresses repeat themselves. Furthermore, what does end up being printed, looks like the below:

To: java-user@lucene.apache.org

From: Adrien Grand < jpountz@gmail.com >

I am looking to only list the actual email address without the 'to', 'from', or the angle brackets (<>).

I'm not well versed with Python but my original way of approaching this was to extract the pure email addresses, and maybe store those somewhere and create a for loop to add them to a list.

Any help or pointers in the right direction would be appreciated.

To the best of my knowledge there is no built-in method for removing duplicates. I would recommend checking out the following links to a solution to that specific problem: https://www.peterbe.com/plog/uniqifiers-benchmark | https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-whilst-preserving-order — K-Log, Oct 13 '18 at 21:28

score 0 · Answer 1 · answered Oct 13 '18 at 21:34

For getting a list of unique emails I would check out the following two articles:

https://www.peterbe.com/plog/uniqifiers-benchmark

How do you remove duplicates from a list whilst preserving order?

For parsing Adrien Grand < jpountz@gmail.com > into a different format, the following link should contain all the information you need.

https://docs.python.org/3.7/library/email.parser.html#module-email.parser

Unfortunately, I don't the time to write you up an example but I hope this helps.

score 0 · Answer 2 · answered Oct 13 '18 at 21:49

0

The simplest way to do it is set().

Sets contain only unique values.

array = [1, 2, 3, 4, 5, 5, 5]
unique_array= set(array)
print(unique_array)  # {1, 2, 3, 4, 5}

answered Oct 13 '18 at 21:49

ramazan polat

7,111
1
48
76

Mehedi H. · Answer 3 · 2021-07-23T06:28:12.627

import re
in_file = open('mbox.txt')
countFromEmail = 0
unique_emails = set() #using a set to maintain an unique list
for line in in_file:
    if re.findall('^From:.+@([^\.]*)\.', line):
        countFromEmail += 1
        line = line.replace("From:","") #replacing the string
        line = line.strip() # then trimming the white spaces
        unique_emails.add(line) #adding to the set
    if re.findall('^To:.+@([^\.]*)\.', line):
        line = line.replace("To:","") #replacing the string
        line = line.strip() #then trimming the white spaces
        unique_emails.add(line) #adding to the set
for email in unique_emails:
    print email

You can achieve this result in many different ways. Using a collection of a set can be one of them. As the elements inside the set are unique (any duplicate elements are discarded by default upon insertion).

Read more here for the unordered collection of unique elements (SET) in python

I have edited and commented your code for you. Hope this helps. Cheers! :)

-Mehedi H.

How to extract list of unique email addresses from a file in Python

3 Answers3