-3

I want to scrape protected email address with [at] and [dot] in python 3 and beautifulsoup 4 My code is here:

email = soup(text=re.compile(r'[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*'))

_emailtokens = str(email).replace("\\t", "").replace("\\n", "").split(' ')

if len(_emailtokens):
    print([match.group(0) for token in _emailtokens for match in [re.search(r"([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)", str(token.strip()))] if match])

Output of my code (every normal emails detected and scraped and introduced as output):

info@abcd.com
    

I need to scrape protected emails with below styles:

info [at] abcd.com
info@abcd [dot] com
info [at] abcd [dot] com
And etc.

I want to get all of this styles (change to normal style) like a normal email (e.g. info@abcd.com)

  • 2
    What is the current output of your code? What isn’t working? – AMC Nov 12 '19 at 08:10
  • Dear @AlexanderCécile, the output added to my question for you. – William Johnson Nov 12 '19 at 09:19
  • Which input does that output correspond to? – AMC Nov 12 '19 at 15:59
  • Dear @AlexanderCécile, this code is able to detect normal emails like info@abc.com and I need to add some protected styles (e.g. info [at] abc [dot] com) to detect as email and change to normal style after detection. – William Johnson Nov 13 '19 at 06:32
  • No one is not here to help me ? – William Johnson Nov 13 '19 at 12:36
  • Can you provide an example of the HTML which contains the address? Have you tried just doing a simple string replace “[at]”->”@“ and “[dot]->”.”? – AMC Nov 13 '19 at 23:24
  • Dear @AlexanderCécile, I need a regex like email = soup(text=re.compile(r'[A-Za-z0-9\.\+_-]+ [@[at]] [A-Za-z0-9\._-]+ [\."[dot]"] + [a-zA-Z]*')) I want to work when we have @ or [at] and also when we have . or [dot] and all recombinations between this 4 states. I don't know how to write this regex. – William Johnson Nov 14 '19 at 11:29
  • Are you running the regex on the entirety of the page’s contents? – AMC Nov 14 '19 at 14:12
  • @AlexanderCécile Yes, is there any problem with this kind of regex usage? – William Johnson Nov 15 '19 at 13:58
  • Generally I would expect BeautifulSoup to be used to find the relevant HTML elements, and then regex on the contents only. – AMC Nov 15 '19 at 15:30

1 Answers1

1

First, the non-warranty statement: You will find on this website what purports to be regular expressions for validating email addresses (See How to validate an email address using a regular expression?). They are very complicated. Needless to say, your basic regex would recognize a subset a valid email addresses, but we will go with that as the basis. The basic regex now becomes:

r'[a-z0-9.+-]+(@|\s*\[\s*at\s*\]\s*)[a-za-z0-9._-]+(\.|\s*\[\s*dot\s*\]\s*)[a-z]*'

compiled with the flag re.IGNORECASE so that, for example, at or AT are equally recognized. This regex also allows flexible spacing as you will see in the following example code:

import re

emails = """info [at] abcd.com
info@abcd [dot] com
info [at] abcd [dot] com
INFO [ AT ] ABCD[ DOT ]COM"""

regex = re.compile(r'[a-z0-9.+-]+(@|\s*\[\s*at\s*\]\s*)[a-za-z0-9._-]+(\.|\s*\[\s*dot\s*\]\s*)[a-z]*', flags=re.IGNORECASE)
for m in regex.finditer(emails):
    print(m.group(0))

Prints:

info [at] abcd.com
info@abcd [dot] com
info [at] abcd [dot] com
INFO [ AT ] ABCD[ DOT ]COM
Booboo
  • 38,656
  • 3
  • 37
  • 60