0

I am trying to scrape websites for emails. I noticed that some emails are not getting picked up... I believe the script I have only picks up emails that are hyperlinked.

import requests
import re
from bs4 import BeautifulSoup

allLinks = [];mails=[]

url = 'https://sourceforge.net/projects/peruggia/'
response = requests.get(url)
soup=BeautifulSoup(response.text,'html.parser')

def findMails(soup):
    for name in soup.find_all():
        if(name is not None):
            emailText=name.text
            match=bool(re.match('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$',emailText))
            if('@' in emailText and match==True):
                emailText=emailText.replace(" ",'').replace('\r','')
                emailText=emailText.replace('\n','').replace('\t','')
                if(len(mails)==0)or(emailText not in mails):
                    print(emailText)
                mails.append(emailText)
findMails(soup)
mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

Target is 'https://sourceforge.net/projects/peruggia/'. The scan should show cyberfiles.hacker@gmail.com. I'm pretty sure I need to edit this line:

for name in soup.find_all():

Any help would be appreciated!

Adam Richard
  • 460
  • 4
  • 13

2 Answers2

0

Try this:

soup.find_all('div', {'class': "review-txt"})

That seems to narrow things down, if that's what you're looking for. I only see one email address on that page, in the above div. I'm not sure if this will suit your purposes, but it's a start.

Keep in mind that you can normally say find_all('tag', attrib='something'), but class is a reserved word, so you have to use the dictionary format.

I also notice that your re.match() always comes back as False, even when there is an email address in the captured text.

GaryMBloom
  • 5,350
  • 1
  • 24
  • 32
0

Try a different regular expression and this can be a lot simpler. I found this expression in an answer here.

Also since you are looking for emails anywhere on the site I just used the findall function over everything in the body tag.

url = 'https://sourceforge.net/projects/peruggia/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')


def findMails(soup):
    data = ''
    for tag in soup('body'):
        data += tag.text.strip()

    return re.findall(
        '[\w\.-]+@[\w\.-]+\.\w+', data)


emails = findMails(soup)
print(emails) if len(emails) > 0 else print('Emails Not found')
## Result: ['cyberfiles.hacker@gmail.com']