stopword removal using python

Question

All,

I have some text that I need to clean up and I have a little algorithm that "mostly" works.

def removeStopwords(self, data):
    with open(r'stopwords.txt') as stopwords:
        wordList = []
        for i in stopwords:
            wordList.append(i.strip())
        charList = list(data)
        cat = ''.join(char for char in charList if not char in wordList).split()
        return ' '.join(cat)

Take the first line on this page. http://en.wikipedia.org/wiki/Paragraph and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars.

A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.

The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.

A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit

Note the word "selfcontained" was "self-contained".

EDIT: Contents of the stopwords file which is just a bunch of chars.

! $ % ^ , & * ( ) { } [ ] <

, . / | \ ? ~ ` : ; "

Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.

        cat = ''.join(data.translate(None, string.punctuation)).split()
        print ' '.join(cat).lower()

What is the contents of stopwords.txt? A list of punctuation symbols, rather then, well, stop words? — Wooble, Feb 22 '12 at 19:46
I usually think of "stopword removal" as removing actual words (such as "of" or "the"); but it seems like what you're trying to do here is actually remove specific characters (eg to strip punctuation). Is that correct? — Edward Loper, Feb 22 '12 at 19:50
@Wooble stopwords is full of non-alphanumeric characters or everything other than letters and numbers. — aeupinhere, Feb 22 '12 at 19:53
@AdamEstrada: Edit your question to include that information. — John Machin, Feb 22 '12 at 20:05
The question invites you to remove **all** the non-alphanumeric chars. You are not removing spaces. — John Machin, Feb 22 '12 at 20:10
@EdwardLoper I needed to remove a punctuation characters and real stopwords. I figured the list was the best approach at the time but now I am using a combo of the two. http://pastebin.com/rVxvhuBi My stopword list is very close to this one. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop — aeupinhere, Feb 23 '12 at 19:52

score 2 · Answer 1 · edited May 23 '17 at 12:14

2

version 2.x

line = 'hello!'
line.translate(None, '!$%') #'hello'

answers

edited May 23 '17 at 12:14

Community

1
1

answered Feb 22 '12 at 19:45

Fred

1,011
1
10
36

+1 Ignore the anonymous downvoter. `str.translate` is the way to go. Maybe change your example to show removing non-alphanumeric chars. – John Machin Feb 22 '12 at 20:15

score 1 · Accepted Answer · edited Sep 19 '13 at 00:01

1

Load your stopwords/stopchars in a separate function.

Don't hard-code file names/paths.

Your wordList should be a set, not a list.

However if you are working with chars, not words, investigate str.translate.

edited Sep 19 '13 at 00:01

Kara

6,115
16
50
57

answered Feb 22 '12 at 20:00

John Machin

81,303
11
141
189

Nope...not HW and I need to remove/replace these characters from my data in order to build Jaccard Indices on them. – aeupinhere Feb 22 '12 at 20:34

score -2 · Answer 3 · answered Feb 22 '12 at 19:51

-2

One way to go would be to use the replace method and have an exhaustive list of characters you don't want.

for example:

c=['a','h']
a= 'john'
for item in c:
    a =a.replace(item,'')
    print a

prints the following: John Jon

answered Feb 22 '12 at 19:51

James R

4,571
3
30
45

Interesting. I get the same results when doing it this way. for item in wordList: data = data.replace(item,'') print data – aeupinhere Feb 22 '12 at 20:00

stopword removal using python

3 Answers3