I was reading: BeautifulSoup Grab Visible Webpage Text The problem with the accepted solution that it might return some hidden text as visible one.
An example of non-hidden check box using XPATH:
CHECK_BOX_XPATH = "//input[(@type='checkbox')" \
" and(not(@style='display: none;')) and(not(@visibility='hidden')) and (not(@hidden)) and" \
" (not(@disabled)) and (not(contains(@class,'disabled')))]"
Can any of these ways be detected by beautifulsoup and not returned as visible text?
Please note, the html source which I'm using beautifulsoup on is complete and full ie it contains all the html and CSS attributes etc... which means it's easy to detect hidden text and ignore it as it's similar to string parsing.
Can beautifulsoup detect any of the cases for hidden items as I did show or all of them?
The accepted answer:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('http://www.nytimes.com/2009/12/21/us/21storm.html').read()
print(text_from_html(html))