2

How can I get only visible text from some HTML node in Python?

Suppose that I have a node like this:

<span>
   <style>.vAnH{display:none}.vsP6{display:inline}</style>
   <span class="vAnH">34</span>
   <span />
   <span style="display: inline">111</span>
   <span style="display:none">120</span>
   <span class="vAnH">120</span>
   <div style="display:none">120</div>
   <span class="78">.</span>
   <span class="vAnH">100</span>
   <div style="display:none">100</div>
   161
   <span style="display: inline">.</span>
   <span class="174">126</span>
   <span class="vAnH">159</span>
   <div style="display:none">159</div>
   <span />
   <span class="vsP6">.</span>
   <span style="display:none">5</span>
   <span class="vAnH">5</span>
   <div style="display:none">5</div>
   <span style="display:none">73</span>
   <span class="vAnH">73</span>
   <div style="display:none">73</div>
   <span class="221">98</span>
   <span style="display:none">194</span>
   <div style="display:none">194</div>
</span>

Is there any third-party libraries to do it or should I parse it manually?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
FrozenHeart
  • 19,844
  • 33
  • 126
  • 242
  • Look into [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) – BrtH Dec 26 '14 at 15:27
  • @BrtH I already use it, but I don't see any solution to get visible text only here – FrozenHeart Dec 26 '14 at 15:28
  • Ok. I think it's possible by using find_all combined with a filter [function](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function). – BrtH Dec 26 '14 at 15:31
  • No standalone HTML parser or XPath engine is going to be able to apply CSS rules for you. You might have to go with a headless browser (ie selenium) to do that kind of thing. – roippi Dec 26 '14 at 15:37
  • (Optionally write logic to filter the display:nones yourself, yes) – roippi Dec 26 '14 at 15:39
  • @FrozenHeart you can do it with BeautifulSoup but you have to check for the parent elements too to get reliable results. Below is a solution that you can try, it works fine. But if it's something that has to perfectly reflect a browser display, then I'd go for PhantomJS or Selenium. – Jivan Dec 26 '14 at 16:33

3 Answers3

1

There are multiple ways to make a node visible/hidden for the end user in the browser. BeautifulSoup is an HTML Parser, it doesn't know if an element would be shown or not. Though, there was an attempt here:

It would not work if, for example, an element is hidden by a CSS rule, but might work for your use case.

The easiest option would be to switch to selenium. .text here returns only visible text of an element:

from selenium import webdriver

driver = webdriver.Firefox() 
driver.get('http://domain.com')

element = driver.find_element_by_id('id_of_an_element')
print(element.text)
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Is there any way to do it without opening browser? – FrozenHeart Dec 26 '14 at 15:46
  • 1
    @FrozenHeart well, a headless `PhantomJS` browser can be automated via `selenium`. Might be an option for you. – alecxe Dec 26 '14 at 15:47
  • @alecxe, when I tried your above code, it shows following error: " Traceback (most recent call last): File "/home/barpa32/anaconda3/lib/python3.8/site-packages/selenium/webdriver/common/service.py", line 72, in start self.process = subprocess.Popen(cmd, env=self.env, File "/home/barpa32/anaconda3/lib/python3.8/subprocess.py", line 854, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/home/barpa32/anaconda3/lib/python3.8/subprocess.py", line 1702, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename)" – tursunWali Feb 16 '21 at 17:02
1

If you don't want to go the Selenium way, you can get something with BeautifulSoup:

from bs4 import BeautifulSoup

def is_visible_span_or_div(tag, is_parent=False):
    """ This function checks if the element is a span or a div,
    and if it is visible. If so, it recursively checks all the parents
    and returns False is one of them is hidden """

    # loads the style attribute of the element
    style = tag.attrs.get('style', False)

    # checks if element is div or span, if it's not a parent
    if not is_parent and tag.name not in ('div', 'span'):
        return False

    # checks if the element is hidden
    if style and ('hidden' in style or 'display: none' in style):
        return False

    # makes a recursive call to check the parent as well
    parent = tag.parent
    if parent and not is_visible_span_or_div(parent, is_parent=True):
        return False

    # neither the element nor its parent(s) are hidden, so return True
    return True

html = """
    <span style="display: none;">I am not visible</span>
    <span style="display: inline">I am visible</span>
    <div style="display: none;">
        <span>I am a visible span inside a hidden div</span>
    </div>
"""

soup = BeautifulSoup(html)

visible_elements = soup.find_all(is_visible_span_or_div)

print(visible_elements)

Keep in mind that it's not going to exactly reflect the way a browser would display or hide the elements, though, because other factors could decide for the visibility of an element (such as width, height, opacity, absolute positioning outside the window...).

Despite of that, this script is quite reliable because it recursively checks for all the element's parents as well and returns False as soon as it finds a hidden parent.

The only problem I see with this function is that it has quite an overhead, because it has to check for all the parents for every element, even if those elements happen to be just aside in the DOM tree. It could be easily optimised for that, but perhaps at the cost of readability.

Jivan
  • 21,522
  • 15
  • 80
  • 131
  • @ Jivan, I tried, your code works, Thank you. I have modified last 3 lines of your code and it prints only the text in the visible element (I mean without HTML markups): "#soup = BeautifulSoup(html, 'features="lxml') soup = BeautifulSoup(html, 'html.parser') visible_elements = soup.find_all(is_visible_span_or_div) for teks in visible_elements: teks=teks.text print(teks) " – tursunWali Feb 16 '21 at 17:09
  • But then, It did not work Right (that means works but show only some parts of visual text [ texts below main text) for this web page: " https://saugeentimes.com/new-perspectives-a-senior-moment-food-glorious-food-part-2/" – tursunWali Feb 16 '21 at 17:26
0

You'll need to write a custom filter function. A working example:

from bs4 import BeautifulSoup
import re

data = '''<span>
   <style>.vAnH{display:none}.vsP6{display:inline}</style>
   <span class="vAnH">34</span>
   <span />
   <span style="display: inline">111</span>
   <span style="display:none">120</span>
   <span class="vAnH">120</span>
   <div style="display:none">120</div>
   <span class="78">.</span>
   <span class="vAnH">100</span>
   <div style="display:none">100</div>
   161
   <span style="display: inline">.</span>
   <span class="174">126</span>
   <span class="vAnH">159</span>
   <div style="display:none">159</div>
   <span />
   <span class="vsP6">.</span>
   <span style="display:none">5</span>
   <span class="vAnH">5</span>
   <div style="display:none">5</div>
   <span style="display:none">73</span>
   <span class="vAnH">73</span>
   <div style="display:none">73</div>
   <span class="221">98</span>
   <span style="display:none">194</span>
   <div style="display:none">194</div>
</span>'''

soup = BeautifulSoup(data)
no_disp = re.search(r'\.(.+?){display:none}', soup.style.string).group(1)

def find_visible(tag):
    return (not tag.name == 'style') and (not no_disp in tag.get('class', '')) and (not 'display:none' in tag.get('style', ''))

for tag in soup.find_all(find_visible, text=True):
    print tag.string
BrtH
  • 2,610
  • 16
  • 27