I'm doing an exercise in which I have to create a program that takes the input of a clipboard copy, parses its contents, and returns a list (in the non-python sense) of the email addresses contained within.
The source file for said input is a sample public domain PDF that has the following layout:
It looks simple enough, except when I copy/paste that input normally (ie wihtout using my program), I get the following output:
Kasey Mcbridemcbrid17@gmail.com939-537-1879Long Cohencohe1696@yahoo.com905-523-5311Hunter Waltonhwalton3@hotmail.com975-675-8521Jacques Deanjacquesd@att.net515-420-4722Nicky Clevelandncleveland88@mac.com573-286-5790
You see where the problem lies: the surname is stuck to the beginning of the email address, and thus my program wouldn't be able to parse the addresses correctly.
Would there be a way, regex or otherwise, to somehow separate these during parsing, or is there nothing to do short of doing it by hand or reformat the file?
So far, my regex looks like this:
email_regex = re.compile(r'''
[a-zA-Z0-9_.+]+ # name part
@ # @
[a-zA-Z0-9_.+]+\.\w{2,3} # domain name part
''', re.VERBOSE)