0

All,

I have some text that I need to clean up and I have a little algorithm that "mostly" works.

def removeStopwords(self, data):
    with open(r'stopwords.txt') as stopwords:
        wordList = []
        for i in stopwords:
            wordList.append(i.strip())
        charList = list(data)
        cat = ''.join(char for char in charList if not char in wordList).split()
        return ' '.join(cat)

Take the first line on this page. http://en.wikipedia.org/wiki/Paragraph and remove all the characters that we are not interested in which in this case are all the non-alphanumeric chars.

A paragraph (from the Greek paragraphos, "to write beside" or "written beside") is a self-contained unit of a discourse in writing dealing with a particular point or idea. A paragraph consists of one or more sentences.[1][2] The start of a paragraph is indicated by beginning on a new line. Sometimes the first line is indented. At various times, the beginning of a paragraph has been indicated by the pilcrow: ¶.

The output looks pretty good except that some of the words are recombined incorrectly and I am unsure how to correct it.

A paragraph from the Greek paragraphos to write beside or written beside is a selfcontained unit

Note the word "selfcontained" was "self-contained".

EDIT: Contents of the stopwords file which is just a bunch of chars.

! $ % ^ , & * ( ) { } [ ] <

, . / | \ ? ~ ` : ; "

Turns out I don't need a list of words at all because I was only really trying to remove characters which in this case were punctuation marks.

        cat = ''.join(data.translate(None, string.punctuation)).split()
        print ' '.join(cat).lower()
aeupinhere
  • 2,883
  • 6
  • 31
  • 39
  • What is the contents of stopwords.txt? A list of punctuation symbols, rather then, well, stop words? – Wooble Feb 22 '12 at 19:46
  • 1
    I usually think of "stopword removal" as removing actual words (such as "of" or "the"); but it seems like what you're trying to do here is actually remove specific characters (eg to strip punctuation). Is that correct? – Edward Loper Feb 22 '12 at 19:50
  • @Wooble stopwords is full of non-alphanumeric characters or everything other than letters and numbers. – aeupinhere Feb 22 '12 at 19:53
  • @AdamEstrada: Edit your question to include that information. – John Machin Feb 22 '12 at 20:05
  • The question invites you to remove **all** the non-alphanumeric chars. You are not removing spaces. – John Machin Feb 22 '12 at 20:10
  • @EdwardLoper I needed to remove a punctuation characters and real stopwords. I figured the list was the best approach at the time but now I am using a combo of the two. http://pastebin.com/rVxvhuBi My stopword list is very close to this one. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop – aeupinhere Feb 23 '12 at 19:52

3 Answers3

2

version 2.x

line = 'hello!'
line.translate(None, '!$%') #'hello'

answers

Community
  • 1
  • 1
Fred
  • 1,011
  • 1
  • 10
  • 36
  • +1 Ignore the anonymous downvoter. `str.translate` is the way to go. Maybe change your example to show removing non-alphanumeric chars. – John Machin Feb 22 '12 at 20:15
1

Load your stopwords/stopchars in a separate function.

Don't hard-code file names/paths.

Your wordList should be a set, not a list.

However if you are working with chars, not words, investigate str.translate.

Kara
  • 6,115
  • 16
  • 50
  • 57
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Nope...not HW and I need to remove/replace these characters from my data in order to build Jaccard Indices on them. – aeupinhere Feb 22 '12 at 20:34
-2

One way to go would be to use the replace method and have an exhaustive list of characters you don't want.

for example:

c=['a','h']
a= 'john'
for item in c:
    a =a.replace(item,'')
    print a

prints the following: John Jon

James R
  • 4,571
  • 3
  • 30
  • 45
  • Interesting. I get the same results when doing it this way. for item in wordList: data = data.replace(item,'') print data – aeupinhere Feb 22 '12 at 20:00