0

I was reading: BeautifulSoup Grab Visible Webpage Text The problem with the accepted solution that it might return some hidden text as visible one.

An example of non-hidden check box using XPATH:

CHECK_BOX_XPATH = "//input[(@type='checkbox')" \
                  " and(not(@style='display: none;')) and(not(@visibility='hidden')) and (not(@hidden)) and" \
                  " (not(@disabled)) and (not(contains(@class,'disabled')))]"

Can any of these ways be detected by beautifulsoup and not returned as visible text?

Please note, the html source which I'm using beautifulsoup on is complete and full ie it contains all the html and CSS attributes etc... which means it's easy to detect hidden text and ignore it as it's similar to string parsing.

Can beautifulsoup detect any of the cases for hidden items as I did show or all of them?

The accepted answer:

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request


def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))
Ariel
  • 1
  • 1
  • is my question clear enough? – Ariel Sep 13 '22 at 06:48
  • You could add some details to clarify, for example an focused extract of HTML that contains both visible and non-visible elements or texts. URL has a paywall, so example is may not working as expected. – HedgeHog Sep 13 '22 at 07:36

1 Answers1

0

To get all the human readable text of the HTML <body> you can use .get_text(), to get rid of redundant whitespaces, etc. set strip parameter and join/separate all by a single whitespace:

import bs4, requests

response = requests.get('https://www.nytimes.com/',headers={'User-Agent': 'Mozilla/5.0'})
soup = bs4.BeautifulSoup(response.text,'lxml')

soup.body.get_text(' ', strip=True)

Note:

From the docs: As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are generally not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Hi, can't I add on top of the code provided above without using `soup.body.get_text`? why the original answer didn't use get_text instead did some intense work like filter etc.. – Ariel Sep 13 '22 at 10:12
  • Not sure what *without using soup.body.get_text* means - Concerning second question, check the date of the other post, there are newer versions of `beautifulsoup` available. – HedgeHog Sep 13 '22 at 11:18