Extracting only meaningful text from webpages

Question

I am getting a list of urls and scraping them using nltk. My end result is in the form of a list with all the words on the webpage in a list. The trouble is that I am only looking for keywords and phrases that are not the usual english "sugar" words such as "as, and, like, to, am, for" etc etc. I know I can construct a file with all common english words and simply remove them from my scraped tokens list, but is there a built in feature for some library that does this automatically?

I am essentially looking for useful words on a page that are not fluff and can give some context to what the page is about. Almost like the tags on stackoverflow or the tags google uses for seo.

possible duplicate of [How to remove stop words using nltk or python](http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python) — tripleee, Apr 04 '14 at 04:06

score 3 · Accepted Answer · edited May 23 '17 at 11:54

I think what you are looking for is the stopwords.words from nltk.corpus:

>>> from nltk.corpus import stopwords
>>> sw = set(stopwords.words('english'))
>>> sentence = "a long sentence that contains a for instance"
>>> [w for w in sentence.split() if w not in sw]
['long', 'sentence', 'contains', 'instance']

Edit: searching for stopword give possible duplicates: Stopword removal with NLTK, How to remove stop words using nltk or python. See the answers of these question. And consider Effects of Stemming on the term frequency? too

score 1 · Answer 2 · answered Apr 04 '14 at 16:46

While you might get robust lists of stop-words in NLTK (and elsewhere), you can easily build your own lists according to the kind of data (register) you process. Most of the words you do not want are so-called grammatical words: they are extremely frequent, so you catch them easily by sorting a frequency list by descending order and discarding the n-top items.

In my experience, the first 100 ranks of any moderately large corpus (>10k tokens of running text) hardly contain any content words.

It seems that you are interested in extracting keywords, however. For this task, pure frequency signatures are not very useful. You will need to transform the frequencies into some other value with respect to a reference corpus: this is called weighting and there are many different ways to achieve it. TfIdf is the industry standard since 1972.

If you are going to spend time doing these tasks, get an introductory handbook for corpus linguistics or computational linguistics.

score 0 · Answer 3 · edited Apr 13 '17 at 12:54

0

You can look for available corpora linquistics for data on frequency of words (along with other annotations).

You can start from links on wikipedia: http://en.wikipedia.org/wiki/Corpus_linguistics#External_links

More information you can probably find at https://linguistics.stackexchange.com/

edited Apr 13 '17 at 12:54

Community

1
1

answered Apr 03 '14 at 21:02

m.wasowski

6,329
1
23
30

Extracting only meaningful text from webpages

3 Answers3