Getting all visible text from a webpage using Selenium

Question

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.

I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.

Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

Your code seems obviously buggy: `for i in allelements: if i.allelements in ferdigtxt: pass ` If `i` is in `allelements`, then `i.allelements` is probably a bug. — Dimitre Novatchev, Oct 31 '11 at 00:06
Another observation is that you seem to compare whole text nodes between themselves and this comparison will probably be false in almost 100% of the cases. If you actually want to compare the words used, then @unutbu 's solution provides this. Please, edit your question and clearly define the problem. — Dimitre Novatchev, Oct 31 '11 at 00:09
@Blender: do `lynx` and `w3c` support javascript? (I doubt it). — jfs, Oct 31 '11 at 10:03
@Dimitre Novatchev - the bug in the code came after I gave the variables english comprehenisble names (from norwegian (although I missed one)). My girlfriend was unfortunalty a tad upset that I was still on the computer (we had a movie night scheduled :), so I had to write in a hurry - and I did a stupid mistake. My apologies to everyone for that. — Rookie, Oct 31 '11 at 11:09

unutbu · Accepted Answer · 2011-10-30T21:21:06.860

10

Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'\n')

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

edited Oct 30 '11 at 21:21

answered Oct 30 '11 at 21:05

unutbu

842,883
184
1,785
1,677

Thank you so much for that thorough answer unutbu! You've used a lot of code that I'm unfarmiliar with, so it will be exiting to read up on your solution. I'm so sorry I didn't specify this earlier - but the reason I was using selenium was to ensure that I could get the javascript rendered text - as I understand it your solution does not offer that. That being said, if I do not find a way to grab both html and javascript rendered text, I will definetly give your solution a try. So thank you so much again! – Rookie Oct 31 '11 at 11:03
The code posted above uses Selenium's webdriver, so it will contain javascript rendered text. If you visit yahoo.com from a browser, however, you'll see a region at the top of the page that changes with time or when your mouse hovers over certain images. I noticed that the code above does not capture all the text possible from that region. I'm not sure of the best way to programatically fix this (reload the page many times? yuck...). Other than that, it should work with most websites. – unutbu Oct 31 '11 at 13:25
Wow, that was great to hear! Thank you so much unutbu - I will be diving in to your code as soon as I get of work :) – Rookie Oct 31 '11 at 15:06

score 5 · Answer 2 · edited May 23 '17 at 12:34

Here's a variation on @unutbu's answer:

#!/usr/bin/env python
import sys
from contextlib import closing

import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean        import Cleaner
from selenium.webdriver     import Firefox         # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug

cache = FileSystemCache('.cachedir', threshold=100000)

url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"


# get page
page_source = cache.get(url)
if page_source is None:
    # use firefox to get page with javascript generated content
    with closing(Firefox()) as browser:
        browser.get(url)
        page_source = browser.page_source
    cache.set(url, page_source, timeout=60*60*24*7) # week in seconds


# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text

I've separated your task in two:

get page (including elements generated by javascript)
extract text

The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Getting all visible text from a webpage using Selenium

2 Answers2

Linked

Related