1

I'm trying to scrape email addresses from a page and having some trouble getting the parent element that contains the email '@' symbol. The emails are embedded within different element tags so I'm unable to just pick them out. There's about 50,000 or so pages that I have to go through.

url = 'https://sec.report/Document/0001078782-20-000134/#f10k123119_ex10z22.htm'

Here are some examples (couple are from different pages I have to scrape):

<div style="border-bottom:1px solid #000000">**dbrenner@umich.edu**</div>

<div class="f3c-8"><u**>Bob@LifeSciAdvisors.com**</u></div>

<p style="margin-bottom:0pt;margin-top:0pt;;text-indent:0pt;;font-family:Arial;font-size:11pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Email: **dmoskowitz@biocept.com**; Phone: 858-320-8244</p>

<td class="f8c-43">E-mail: <u>jcohen@2020gene.com</u></td>

<p class="f7c-4">Email: jcohen@2020gene.com</p>

What I have tried:

  1. I tried find_all('div') to get the ResultSet of all the divs to get the ones that has '@' symbol in it.
div = page.find_all('div')
for each in div:
    if '@' in each.text: 
        print(each.text)

When I did this, due to the body being in a 'div', it printed the whole page. Fail. Since the emails are embedded within different tags, it seems inefficient for this method

  1. Using Regular Expression. I tried using regular expression to pick out the emails but it gets bunch of texts that's not usable which I would have to manually split up, replace characters, etc. This just seemed a daunting task to go through all the different scenarios.
    import re
    emails = re.findall('\S+@\S+', str(page))
    for each in emails:
        print(each)

Doing this gave me something like this :

hidden;}@media
#000000">dbrenner@umich.edu</div>
#000000">kherman@umich.edu
#000000">spage@fredhutch.org</div>
#000000">mtuck@umich.edu</div>
#000000">jdahlgre@fredhutch.org</div></p>
#000000">lafky.jacqueline@mayo.edu</div></p>
mtuck@umich.edu)</div>
#000000">ctsucontact@westat.com</div>.
href="http://@umich.edu">@umich.edu</a></li><li><a

Now I can go in and split some of the texts using .split('<') and then split again, etc. but they're not all same and since I have to scrape 50,000+ pages with 100 entries in each page, there's a lot I have to scrape and take into consideration.

I tried looking on google and stackoverflow but all I can find are solutions where people are looking for the text within a certain element, etc.

What I need is 'How to find the parent element that contains an email' specifically

I don't think I would need to use Selenium for this since the issue would be similar to using Beautifulsoup and the site is not JavaScript rendered other than some of the pages being a pdf, which is whole another issue.

Any insight, help or advice is appreciated. Thanks.

uclaastro
  • 593
  • 1
  • 6
  • 17

1 Answers1

1

There's two options to search for text that contains an @ symbol:

  1. Use a CSS Selector :contains(<MY TEXT>) to search for text that have a @ symbol in them.

  2. Use a lambda function in the find_all() method, and search if @ is in the .text() of the soup.

Option 1:

from bs4 import BeautifulSoup


html = """<div style="border-bottom:1px solid #000000">**dbrenner@umich.edu**</div>

<div class="f3c-8"><u**>Bob@LifeSciAdvisors.com**</u></div>

<p style="margin-bottom:0pt;margin-top:0pt;;text-indent:0pt;;font-family:Arial;font-size:11pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">Email: **dmoskowitz@biocept.com**; Phone: 858-320-8244</p>

<td class="f8c-43">E-mail: <u>jcohen@2020gene.com</u></td>

<p class="f7c-4">Email: jcohen@2020gene.com</p>"""

soup = BeautifulSoup(html, "html.parser")

for tag in soup.select('*:contains("@")'):
    print(tag.text.strip())

Option 2:

for tag in soup.find_all(lambda t: "@" in t.text.strip()):
    print(tag.text.strip())
MendelG
  • 14,885
  • 4
  • 25
  • 52
  • I'm tried your method on the website but getting IOPub data rate exceeded warning: IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`. Current values: NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec) NotebookApp.rate_limit_window=3.0 (secs) – uclaastro Nov 29 '20 at 03:36
  • @uclaastro This is a problem with your Jupyter. [see this SO post to fix it](https://stackoverflow.com/questions/43288550/iopub-data-rate-exceeded-in-jupyter-notebook-when-viewing-image) – MendelG Nov 29 '20 at 03:39
  • @MendelIG So I did that and the script does run but now the page becomes non-responsive due to the amount of text that's being returned. Since '*:contains("@")' also covers the initial 'div', it is returning the whole page and it's giving me a timing out prompt – uclaastro Nov 29 '20 at 05:06