Scrape text not contained in any element

Question

I'm scraping a very poorly written site with Beautiful Soup 4. I've got everything but the user's email address, which isn't in any containing element that distinguishes it. Any ideas how to scrape it? next_sibling of the strong element skips right over it, as I expected.

<div class="fieldset-wrapper">
 <strong>
  E-mail address:
 </strong>
 useremail@yahoo.com
 <div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
  <div class="field-items">

Please insert the code you're using to scrape the HTML code. — cdonts, Mar 08 '15 at 01:29

jedwards · Accepted Answer · 2015-03-08T01:37:53.733

I'm not sure this is the best way, but you could get the parent element, then iterate over its children and look at the non-tags:

from bs4 import BeautifulSoup
import bs4

html='''
<div class="fieldset-wrapper">
 <strong>
  E-mail address:
 </strong>
 useremail@yahoo.com
 <div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
  <div class="field-items">
'''


def print_if_email(s):
    if '@' in s: print s

soup = BeautifulSoup(html)

# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
    # Iterate over the children of each matching div
    for c in div.children:
        # If it wasn't parsed as a tag, it may be a NavigableString
        if isinstance(c, bs4.element.NavigableString):
            # Some heuristic to identify email addresses if other non-tags exist
            print_if_email(c.strip())

Prints:

useremail@yahoo.com

Of course this the inner for loop and if statement could be combined to:

for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):

score 0 · Answer 2 · edited May 23 '17 at 12:20

I can't answer your question directly as I've never used Beautiful Soup (so do NOT accept this answer!) but just want to remind you if the pages are all pretty simple, an alternative option might be to write your own parser using .split()?

This is rather clumsy, but worth considering if pages are simple/predictable...

That is, if you know something about the overall layout of the page (e.g., user email is always first email mentioned) you could write your own parser, to find the bit before and after the '@' sign

# html = the entire document as a string

# return the entire document up to the '@' sign
bit_before_at_sign = html.split('@')[0]
# only useful if you know first email is the one you care about

# you could then cut out everything before username with something like this
b = bit_before_at_sign
# a very long string, we just want the last bit right before the @ sign
username = b.split(' ')[-1].split('\n')[-1].split('\r')[-1].split('\r')[-1].split(';')[-1]
# add more if required, depending on how the html looks to you 
# (I've just guessed some html elements that might precede the username)

# you could similarly parse the bit after the @ sign, 
# html.split('@')[1]  
# e.g., checking the first few characters of this
# against a known list of .tlds like '.com', '.co.uk', etc  
# (remember some TLDs have more than one period, so don't just parse by '.')
# and combine with the username you already know

Also at your disposal, in case you want to narrow down which bit of the document you focus on:

In case you want to make sure the word 'e-mail' is also in the string you're parsing

if 'email' in lower(b) or 'e-mail' in lower(b):
    # do something...

To check where in the document the @ symbol first appears

html.index('@')
# e.g., if you want to see how near this '@' symbol is to some other element you know about 
# such as the word 'e-mail', or a particular div element or '</strong>'

To confine your search for an email to the 300 characters before/after another element you know about:

startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]

I imagine a few minutes more on Google may alternatively prove useful; your task doesn't sound unusual :)

And make sure you consider cases where there are multiple email addresses on the page (e.g., so you don't assign support@site.com to every user!)

Whatever method you go with, if you have doubts, might be worth checking your answer using email.utils.parseaddr() or someone else's regex checker. See previous question

I don't understand why, but suggesting parsing html with regex is the quickest way to get the Stack Overflow hordes to attack your home with mechanized bumblebees. Beware! — thumbtackthief, Mar 08 '15 at 16:00

Scrape text not contained in any element

2 Answers2