9

I am using Python with BeautifulSoup4 and I need to retrieve visible links on the page. Given this code:

soup = BeautifulSoup(html)
links = soup('a')

I would like to create a method is_visible that checks whether or not a link is displayed on the page.

Solution Using Selenium

Since I am working also with Selenium I know that there exist the following solution:

from selenium.webdriver import Firefox

firefox = Firefox()
firefox.get('https://google.com')
links = firefox.find_elements_by_tag_name('a')

for link in links:
    if link.is_displayed():
        print('{} => Visible'.format(link.text))
    else:
        print('{} => Hidden'.format(link.text))

firefox.quit()

Performance Issue

Unfortunately the is_displayed method and getting the text attribute perform a http request to retrieve such informations. Therefore things can get really slow when there are many links on a page or when you have to do this multiple times.

On the other hand BeautifulSoup can perform these parsing operations in zero time once you get the page source. But I can't figure out how to do this.

blueSurfer
  • 5,651
  • 13
  • 42
  • 63
  • I think the best you can do is to check for the `style` attribute of the beautiful soup tag and parse the value to see if `display:none` of similar is in it. – Germano Mar 17 '14 at 12:21
  • 7
    Unfortunately Beautifulsoup is an html parser, not an browser so It didn't know nothing about how the page had to be rendered. I think you have to stick with Selenium. – fasouto Mar 17 '14 at 12:21
  • pyself, I'm pretty sure @fasouto is right. beautifulsoup doesn't actually render anything, and if you read the selenium documentation, it automates BROWSERS.. not just plain HTML. I think you'll have to stick to selenium doing it's thing if you really want to do this. – ddavison Mar 17 '14 at 12:23
  • Elements are hidden with inline, linked or internal CSS (`input` excepted). Or hidden with JS. Then you have other invisible stuff like white text on white background. What exactly do you want to check? Only CSS `display:none`? Than you need to parse **all** stylesheets with *tinycss* and see if a rule matches the element. If you find a match, check what styles get applied. The difficulty is the cascading part. Also if a parent is hidden the child is hidden. So you have to check all parents of that element if they are visible as well... Or just stick to Selenium. – allcaps Mar 17 '14 at 12:35
  • 1
    Take a look at this thread: http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text – alecxe Mar 17 '14 at 13:04
  • I want to check just if a link is displayed on the page but not from the user perspective (i.e. white text on white background). So the only cases are `display:none` and `type=hidden`. I think your solution should work but I am concerned about the performance in the cascading part. – blueSurfer Mar 17 '14 at 13:06
  • @pyself: Thinking out loud about the cascading part: If you collected all linked, internal and inline CSS (don't forget CCS linked with import rules and be careful with media rules). Start matching all selectors *that hide content* against a copy of your soup. For each selector, select and remove the element from soup... Now you can check an element for visibility by inspecting if this element is available in the copied soup. If it's gone, it's hidden. Disclaimer: Many pitfalls! Only a very rough indication and by far not the certainty of Selenium. – allcaps Mar 18 '14 at 11:18
  • How do you need to process this information? Could you at least collect all the links first and proceed with other processing? My reaction is no matter what approach you take, this probably the way to go if you are concerned about performance. Always a trade-off between accuracy vs. performance, so decide which. You can put everything into futures or some other concurrent structure you can access later when it is ready while proceeding happily, or queue it and process it in parallel in the background. You could also create bolts in Storm if you need something to happen continuously/real-time. – therewillbesnacks Mar 26 '14 at 12:30
  • Just to note that I don't believe the `is_displayed` function performs a http request - once you have the page, that should be all that's needed. Certainly if I run something like you have here, say `for link in links: print (link.text, link.is_displayed())`, it only makes the one request and is very quick at the loop. – M Somerville Mar 26 '14 at 20:28

2 Answers2

1

AFAIK, BeautifulSoup will only help you parse the actual markup of the HTML document anyway. If that's all you need, then you can do it in a manner like so (yes, I already know it's not perfect):

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)


def is_visible_1(link):
    #do whatever in this function you can to determine your markup is correct
    try:
        style = link.get('style')
        if 'display' in style and 'none' in style:#or use a regular expression
            return False
    except Exception:
        return False
    return True

def is_visible_2(**kwargs):
    try:
        soup = kwargs.get('soup', None)
        del kwargs['soup']
        #Exception thrown if element can't be found using kwargs
        link = soup.find_all(**kwargs)[0]
        style = link.get('style')
        if 'display' in style and 'none' in style:#or use a regular expression
            return False
    except Exception:
        return False
    return True


#checks links that already exist, not *if* they exist
for link in soup.find_all('a'):
    print(str(is_visible_1(link)))

#checks if an element exists
print(str(is_visible_2(soup=soup,id='someID')))

BeautifulSoup doesn't take into account other parties that will tell you that the element is_visible or not, like: CSS, Scripts, and dynamic DOM changes. Selenium, on the other hand, does tell you that an element is actually being rendered or not and generally does so through accessibility APIs in the given browser. You must decide if sacrificing accuracy for speed is worth pursuing. Good luck! :-)

1

try with find_elements_by_xpath and execute_script

from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.google.com/?hl=en")
links = driver.find_elements_by_xpath('//a')

driver.execute_script('''
    var links = document.querySelectorAll('a');

    links.forEach(function(a) {
        a.addEventListener("click", function(event) {
            event.preventDefault();
        });
    });
''')

visible = []
hidden = []
for link in links:
    try:
        link.click()
        visible.append('{} => Visible'.format(link.text))
    except:
        hidden.append('{} => Hidden'.format(link.get_attribute('textContent')))

    #time.sleep(0.1)

print('\n'.join(visible))
print('===============================')
print('\n'.join(hidden))
print('===============================\nTotal links length: %s' % len(links))

driver.execute_script('alert("Finish")')
ewwink
  • 18,382
  • 2
  • 44
  • 54