I would like to scrape all English words from, say, New York Times front page. I wrote something like this in Python:
import re
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE)
print tokens
This works okay, but I get HTML keywords such as "img", "src" as well as English words. Is there a simple way to get only English words from Web scaping / HTML ?
I saw this post, it only seems to talk about the mechanics of scraping, none of the tools mentioned talk about how to filter out non-language elements. I am not interested in links, formatting, etc. Just plain words. Any help would be appreciated.