I'm trying to scrape email addresses from a page and having some trouble getting the parent element that contains the email '@' symbol. The emails are embedded within different element tags so I'm unable to just pick them out. There's about 50,000 or so pages that I have to go through.
url = 'https://sec.report/Document/0001078782-20-000134/#f10k123119_ex10z22.htm'
Here are some examples (couple are from different pages I have to scrape):
<div style="border-bottom:1px solid #000000">**dbrenner@umich.edu**</div>
<div class="f3c-8"><u**>Bob@LifeSciAdvisors.com**</u></div>
<p style="margin-bottom:0pt;margin-top:0pt;;text-indent:0pt;;font-family:Arial;font-size:11pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Email: **dmoskowitz@biocept.com**; Phone: 858-320-8244</p>
<td class="f8c-43">E-mail: <u>jcohen@2020gene.com</u></td>
<p class="f7c-4">Email: jcohen@2020gene.com</p>
What I have tried:
- I tried find_all('div') to get the ResultSet of all the divs to get the ones that has '@' symbol in it.
div = page.find_all('div')
for each in div:
if '@' in each.text:
print(each.text)
When I did this, due to the body being in a 'div', it printed the whole page. Fail. Since the emails are embedded within different tags, it seems inefficient for this method
- Using Regular Expression. I tried using regular expression to pick out the emails but it gets bunch of texts that's not usable which I would have to manually split up, replace characters, etc. This just seemed a daunting task to go through all the different scenarios.
import re
emails = re.findall('\S+@\S+', str(page))
for each in emails:
print(each)
Doing this gave me something like this :
hidden;}@media
#000000">dbrenner@umich.edu</div>
#000000">kherman@umich.edu
#000000">spage@fredhutch.org</div>
#000000">mtuck@umich.edu</div>
#000000">jdahlgre@fredhutch.org</div></p>
#000000">lafky.jacqueline@mayo.edu</div></p>
mtuck@umich.edu)</div>
#000000">ctsucontact@westat.com</div>.
href="http://@umich.edu">@umich.edu</a></li><li><a
Now I can go in and split some of the texts using .split('<') and then split again, etc. but they're not all same and since I have to scrape 50,000+ pages with 100 entries in each page, there's a lot I have to scrape and take into consideration.
I tried looking on google and stackoverflow but all I can find are solutions where people are looking for the text within a certain element, etc.
What I need is 'How to find the parent element that contains an email' specifically
I don't think I would need to use Selenium for this since the issue would be similar to using Beautifulsoup and the site is not JavaScript rendered other than some of the pages being a pdf, which is whole another issue.
Any insight, help or advice is appreciated. Thanks.