I am trying to scrape websites for emails. I noticed that some emails are not getting picked up... I believe the script I have only picks up emails that are hyperlinked.
import requests
import re
from bs4 import BeautifulSoup
allLinks = [];mails=[]
url = 'https://sourceforge.net/projects/peruggia/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')
def findMails(soup):
for name in soup.find_all():
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
if('@' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('\r','')
emailText=emailText.replace('\n','').replace('\t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")
Target is 'https://sourceforge.net/projects/peruggia/
'. The scan should show cyberfiles.hacker@gmail.com
. I'm pretty sure I need to edit this line:
for name in soup.find_all():
Any help would be appreciated!