I am trying to extract a list of unique email addresses from a .txt file (https://www.py4e.com/code3/mbox.txt) that contains multiple email messages. I am able to pull a list of email addresses by narrowing my search to the 'From:' and 'To:' lines with the below program:
import re
in_file = open('dummy_text_file.txt')
for line in in_file:
if re.findall('^From:.+@([^\.]*)\.', line):
countFromEmail = countFromEmail + 1
print(line)
if re.findall('^To:.+@([^\.]*)\.', line):
print(line)
However, this does not provide me with a unique list as various of the email addresses repeat themselves. Furthermore, what does end up being printed, looks like the below:
To: java-user@lucene.apache.org
From: Adrien Grand < jpountz@gmail.com >
I am looking to only list the actual email address without the 'to', 'from', or the angle brackets (<>).
I'm not well versed with Python but my original way of approaching this was to extract the pure email addresses, and maybe store those somewhere and create a for loop to add them to a list.
Any help or pointers in the right direction would be appreciated.