-1

I have used a regex for email validation for a google scraper to grab email addresses. The problem is there are several emails not being matched by this because they start with http://. I am not great at creating regexes and this is already very long. Here Is the code I have thus far

emailregex = r'''(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-zA-Z0-9-]*[a-zA-Z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\]|[(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?])'''

driver.get("https://www.google.com")
search = driver.find_element_by_xpath("//input[@name='q']")
search.send_keys(searchterm)

submit = driver.find_element_by_xpath("//input[@type='submit']")
driver.execute_script("arguments[0].click();", submit)
doc = driver.page_source

email_list = []

for re_match in re.finditer(emailregex, doc):
    email_list.append(re_match.group())

while True:
    try :
        next_page = driver.find_element(By.ID, "pnnext")
        driver.execute_script("arguments[0].click();", next_page)
        doc = driver.page_source
        for re_match in re.finditer(emailregex, doc):
            email_list.append(re_match.group())
    except :
        break
    
for i, email in enumerate(email_list):
    print(f'{i + 1}: {email}')
  • 6
    In what scenario does an email address have an `http://` prefix? – jarmod Jul 26 '22 at 13:56
  • 2
    That doesn't sound like a valid email. Surely you can't send an email to `http://some.email+address@somecompany.com`, right? This is also one of the reasons why regex just isn't the best for email address validation. There's ALWAYS an edge case. – JNevill Jul 26 '22 at 13:57
  • I believe in these cases they have a gmail account they put http:// in front of to jump to it. – Steven Mullikin Jul 26 '22 at 14:02
  • Interesting. It looks like Google does redirect you to the inbox via 307 internal redirect to `https://addr@gmail.com` and then 301 redirect to `https://www.google.com/gmail`, then 302 redirect to `https://mail.google.com/mail/`, then 302 redirect to the inbox at `https://mail.google.com/mail/u/0/`. – jarmod Jul 26 '22 at 14:11
  • [Oh interesting](https://stackoverflow.com/a/19511469/16450169) – 0x263A Jul 26 '22 at 14:46

1 Answers1

0

You can add (http(s)?:\/\/)? at the start of the regex to add http:// or https://.

Link to check the new regex: https://regex101.com/r/LPG1NW/1

FedeG
  • 106
  • 6
  • This worked out very well after I changed it to Python and escaped the two " characters. However, when doing the search the email domain sometimes has the tags (ie test@test.com) and around it. Would you be so kind as to show me how to add that to the regex? – Steven Mullikin Jul 26 '22 at 15:17