4

I would like to scrape all English words from, say, New York Times front page. I wrote something like this in Python:

import re
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'            

opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE) 
print tokens

This works okay, but I get HTML keywords such as "img", "src" as well as English words. Is there a simple way to get only English words from Web scaping / HTML ?

I saw this post, it only seems to talk about the mechanics of scraping, none of the tools mentioned talk about how to filter out non-language elements. I am not interested in links, formatting, etc. Just plain words. Any help would be appreciated.

Community
  • 1
  • 1
BBSysDyn
  • 4,389
  • 8
  • 48
  • 63

5 Answers5

4

Are you sure you want "English" words -- in the sense that they appear in some dictionary? For example, if you scraped an NYT article, would you want to include "Obama" (or "Palin" for you Blue-Staters out there), even though they probably don't appear in any dictionaries yet?

Better, in many cases, to parse the HTML (using BeautifulSoup as Bryan suggests) and include only the text-nodes (and maybe some aimed-at-humans attributes like "title" and "alt").

Michael Lorton
  • 43,060
  • 26
  • 103
  • 144
  • I just realized this has been answered, using BeatifulSoup and NYTimes as an example even here http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text – BBSysDyn Jun 21 '11 at 00:24
  • yes, I'd want obama and palin, basically i would want all visible words, not simply "English" words. Sorry for the confusion. I would not want dictionary lookups, I might use this code for other languages as well. – BBSysDyn Jun 21 '11 at 00:24
1

You would need some sort of English dictionary reference. A simple way of doing this would be to use a spellchecker. PyEnchant comes to mind.

From the PyEnchant website:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

In your case, perhaps something along the lines of:

d = enchant.Dict("en_US")
english_words = [tok for tok in tokens if d.check(tok)]

If that's not enough and you don't want "English words" that may appear in an HTML tag (such as an attribute) you could probably use BeautifulSoup to parse out only the important text.

Bryan
  • 6,529
  • 2
  • 29
  • 16
1

Html2Text can be a good option.

import html2text

print html2text.html2text(your_html_string)

Community
  • 1
  • 1
Yajushi
  • 1,175
  • 2
  • 9
  • 24
0

I love using the lxml library for this:

# copypasta from http://lxml.de/lxmlhtml.html#examples
import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
els = el.find_class(class_name)
if els:
    return els[0].text_content()

Then to ensure the scraped words are only English words you could use look them up in a dictionary you load from a text file or NLTK that comes with many cool corpora and language processing tools.

Robert
  • 14,999
  • 4
  • 39
  • 46
0

You can replace all <.*> with nothing or a space. Use the re module, and make sure you understand greedy and non greedy pattern matching. You need non-greedy for this.

Then once you have stripped all the tags, apply the strategy you were using.

Nickle
  • 367
  • 3
  • 5