0

I'm doing an exercise in which I have to create a program that takes the input of a clipboard copy, parses its contents, and returns a list (in the non-python sense) of the email addresses contained within.

The source file for said input is a sample public domain PDF that has the following layout:

Input text layout


It looks simple enough, except when I copy/paste that input normally (ie wihtout using my program), I get the following output:

Kasey Mcbridemcbrid17@gmail.com939-537-1879Long Cohencohe1696@yahoo.com905-523-5311Hunter Waltonhwalton3@hotmail.com975-675-8521Jacques Deanjacquesd@att.net515-420-4722Nicky Clevelandncleveland88@mac.com573-286-5790

You see where the problem lies: the surname is stuck to the beginning of the email address, and thus my program wouldn't be able to parse the addresses correctly.

Would there be a way, regex or otherwise, to somehow separate these during parsing, or is there nothing to do short of doing it by hand or reformat the file?

So far, my regex looks like this:

email_regex = re.compile(r'''

[a-zA-Z0-9_.+]+             # name part

@                           # @

[a-zA-Z0-9_.+]+\.\w{2,3}    # domain name part

''', re.VERBOSE)
ledebutant
  • 123
  • 5
  • 2
    I am afraid regex won't help, your email usernames are not following any generic pattern wrt names. – Wiktor Stribiżew Mar 01 '21 at 10:30
  • 4
    Is there a way to turn something like `"Kasey Mcbridemcbrid17@gmail.com"` into `"mcbrid17@gmail.com"` automatically? No, there isn't, forget it. But a tool that can read the PDF *structure* instead of trying to dissect the mess that copy&paste creates, you could have more luck. For example, give [pdfminer](https://pypi.org/project/pdfminer/) a spin and see how far it gets you with your files ([see](https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python)). – Tomalak Mar 01 '21 at 10:31
  • You probably wanna have a look at this post: https://stackoverflow.com/questions/55139685/how-to-extract-email-from-pdf – cochaviz Mar 01 '21 at 11:26
  • Does this answer your question? [how to extract email from pdf](https://stackoverflow.com/questions/55139685/how-to-extract-email-from-pdf) – Ryszard Czech Mar 01 '21 at 21:27

1 Answers1

0

Pattern

sample = 'Kasey Mcbridemcbrid17@gmail.com939-537-1879Long Cohencohe1696@yahoo.com905-523-5311Hunter Waltonhwalton3@hotmail.com975-675-8521Jacques Deanjacquesd@att.net515-420-4722Nicky Clevelandncleveland88@mac.com573-286-5790'

pattern = '(?:([a-zA-Z0-9_.]+)@([a-z]+)\.([a-z]{2,5}))'
result =[{"name": x, "provider": y, "domain": z} for x,y,z in re.findall(pattern, sample)]

output:

[{'name': 'Mcbridemcbrid17', 'provider': 'gmail', 'domain': 'com'},
{'name': 'Cohencohe1696', 'provider': 'yahoo', 'domain': 'com'},
{'name': 'Waltonhwalton3', 'provider': 'hotmail', 'domain': 'com'},
{'name': 'Deanjacquesd', 'provider': 'att', 'domain': 'net'},
{'name': 'Clevelandncleveland88', 'provider': 'mac', 'domain': 'com'}]
Leonardo Scotti
  • 1,069
  • 8
  • 21