12

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.

I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.

After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:

from selenium import webdriver
import codecs

filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')

driver = webdriver.Firefox()

driver.get("http://www.examplepage.com")

allelements = driver.find_elements_by_xpath("//*")

ferdigtxt = []

for i in allelements:

      if i.text in ferdigtxt:
          pass
  else:
         ferdigtxt.append(i.text)
         filen.writelines(i.text)

filen.close()

driver.quit()

The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)

I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.

Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.

Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text

Code Jockey
  • 6,611
  • 6
  • 33
  • 45
Rookie
  • 1,590
  • 5
  • 20
  • 34
  • `lynx` and `w3c` can both do this via the CLI. – Blender Oct 30 '11 at 20:25
  • Shouldn't your xpath be something like `//body/text()`? – Pankrat Oct 30 '11 at 21:12
  • Your code seems obviously buggy: `for i in allelements: if i.allelements in ferdigtxt: pass ` If `i` is in `allelements`, then `i.allelements` is probably a bug. – Dimitre Novatchev Oct 31 '11 at 00:06
  • Another observation is that you seem to compare whole text nodes between themselves and this comparison will probably be false in almost 100% of the cases. If you actually want to compare the words used, then @unutbu 's solution provides this. Please, edit your question and clearly define the problem. – Dimitre Novatchev Oct 31 '11 at 00:09
  • @Blender: do `lynx` and `w3c` support javascript? (I doubt it). – jfs Oct 31 '11 at 10:03
  • @Dimitre Novatchev - the bug in the code came after I gave the variables english comprehenisble names (from norwegian (although I missed one)). My girlfriend was unfortunalty a tad upset that I was still on the computer (we had a movie night scheduled :), so I had to write in a hurry - and I did a stupid mistake. My apologies to everyone for that. – Rookie Oct 31 '11 at 11:09

2 Answers2

10

Using lxml, you might try something like this:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean

url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
    browser.get(url) # Load page
    content=browser.page_source
    cleaner=clean.Cleaner()
    content=cleaner.clean_html(content)    
    with open('/tmp/source.html','w') as f:
       f.write(content.encode('utf-8'))
    doc=LH.fromstring(content)
    with open('/tmp/result.txt','w') as f:
        for elt in doc.iterdescendants():
            if elt.tag in ignore_tags: continue
            text=elt.text or ''
            tail=elt.tail or ''
            words=' '.join((text,tail)).strip()
            if words:
                words=words.encode('utf-8')
                f.write(words+'\n') 

This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you so much for that thorough answer unutbu! You've used a lot of code that I'm unfarmiliar with, so it will be exiting to read up on your solution. I'm so sorry I didn't specify this earlier - but the reason I was using selenium was to ensure that I could get the javascript rendered text - as I understand it your solution does not offer that. That being said, if I do not find a way to grab both html and javascript rendered text, I will definetly give your solution a try. So thank you so much again! – Rookie Oct 31 '11 at 11:03
  • The code posted above uses Selenium's webdriver, so it will contain javascript rendered text. If you visit yahoo.com from a browser, however, you'll see a region at the top of the page that changes with time or when your mouse hovers over certain images. I noticed that the code above does not capture all the text possible from that region. I'm not sure of the best way to programatically fix this (reload the page many times? yuck...). Other than that, it should work with most websites. – unutbu Oct 31 '11 at 13:25
  • Wow, that was great to hear! Thank you so much unutbu - I will be diving in to your code as soon as I get of work :) – Rookie Oct 31 '11 at 15:06
5

Here's a variation on @unutbu's answer:

#!/usr/bin/env python
import sys
from contextlib import closing

import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean        import Cleaner
from selenium.webdriver     import Firefox         # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug

cache = FileSystemCache('.cachedir', threshold=100000)

url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"


# get page
page_source = cache.get(url)
if page_source is None:
    # use firefox to get page with javascript generated content
    with closing(Firefox()) as browser:
        browser.get(url)
        page_source = browser.page_source
    cache.set(url, page_source, timeout=60*60*24*7) # week in seconds


# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text

I've separated your task in two:

  • get page (including elements generated by javascript)
  • extract text

The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670