4

I am having trouble creating code which removes stop words from a string input. Currently, here is my code:

stopWords = [ "a", "i", "it", "am", "at", "on", "in", "to", "too", "very", \
                 "of", "from", "here", "even", "the", "but", "and", "is", "my", \
                 "them", "then", "this", "that", "than", "though", "so", "are" ]
stemEndings = [ "-s", "-es", "-ed", "-er", "-ly" "-ing", "-'s", "-s'" ]
punctuation = [ ".", ",", ":", ";", "!", "?" ]
line = raw_input ("Type in lines, finish with a . at start of line only:")
while line != ".":
    def remove_punctuation(input): #removes punctuation from input
        output = ""
        text= 0
        while text<=(len(input)-1) :
            if input[text] not in punctuation:
               output=output + input[text]
            text+=1
        return output
    newline= remove_punctuation(line)
    newline= newline.lower()

What code could be added to remove stopWords from a string based on the stopWords list above? Thank you in advance.

user3052287
  • 61
  • 1
  • 6

4 Answers4

3

As I undestand your problem, you whant to remove punctuation from an input string. My variant remove_punctuation function:

def remove_punctuation(input_string):
    for item in punctuation:
        input_string = input_string.replace(item, '')
    return input_string
greg
  • 1,417
  • 9
  • 28
3

As greg suggested, you should use a for loop instead of a while because it is more pythonic & easy to understand the code. Also, you should make your function declaration before the while loop for input, so that the python interpreter does not re-define the function everytime!

Also, if you want, you can set punctuation to a string rather than a list (for readability & ease)

stopWords = [ "a", "i", "it", "am", "at", "on", "in", "to", "too", "very", \
              "of", "from", "here", "even", "the", "but", "and", "is", "my", \
              "them", "then", "this", "that", "than", "though", "so", "are" ]
stemEndings = [ "-s", "-es", "-ed", "-er", "-ly" "-ing", "-'s", "-s'" ]
punctuation = ".,:;!?"

def remove_punctuation(input_string):
    for item in punctuation:
        input_string = input_string.replace(item, '')
    return input_string

line = raw_input ("Type in lines, finish with a . at start of line only:")

while not line == ".":
    newline = remove_punctuation(line)
    newline = newline.lower()
shad0w_wa1k3r
  • 12,955
  • 8
  • 67
  • 90
0

I find something interesting in another post that boost your code performance a lot. Try use set like it mentioned in below link. Faster way to remove stop words in Python

Credit goes to alko

Community
  • 1
  • 1
eSadr
  • 395
  • 5
  • 21
  • Welcome to Stack Overflow! Please quote the most relevant part of the link, in case the target site is unreachable or goes permanently offline. See [How do I write a good answer](http://stackoverflow.com/help/how-to-answer). – ByteHamster Feb 27 '15 at 19:38
0

You can use NTLK library instead of defining the stopping words.

pip install nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
for word in stop_words:
      sw = re.sub(r'(?<!\S)' + word + '+(?!\S)', "", sw, flags=re.IGNORECASE)

Also, performance can be enhanced with creating single compiled regex for all stop words and use it once.

dgunseli
  • 1
  • 1