2

I am working on Python script in which I want to remove the common english words like "the","an","and","for" and many more from a String. Currently what I have done is I have made a local list of all such words and I just call remove() to remove them from the string. But I want here some pythonish way to achieve this. Have read about nltk and wordnet but totally clueless about that's what I should use and how to use it.

Edit

Well I don't understand why marked as duplicate as my question does not in any way mean that I know about Stop words and now I just want to know how to use it.....the question is about what I can use in my scenario and answer to that was stop words...but when I posted this question I din't know anything about stop words.

Yogesh D
  • 1,663
  • 2
  • 23
  • 38
  • 2
    Look for "stop word removal"... and your basic approach is not that wrong... – dsign Apr 07 '14 at 06:06
  • Stop word might be useful in my scenario ...would be searching on it now ...thnx...got any link to stop words tutorial.>? – Yogesh D Apr 07 '14 at 06:22
  • 2
    No tutorial sorry... but what your are doing is correct. Just get a long list of stop words and then use the regular expressions module to replace stop words by empty strings. – dsign Apr 07 '14 at 06:24
  • 1
    ok got it thnx...the way it is done [here](http://stackoverflow.com/questions/19560498/faster-way-to-remove-stop-words-in-python) by Alfe. – Yogesh D Apr 07 '14 at 06:33

3 Answers3

2

Do this.

vocabular = set (english_dictionary)
unique_words = [word for word in source_text.split() if word not in vocabular]

It is simple and efficient as can be. If you don't need positions of unique words, make them set too! Operator in is extremely fast on sets (and slow on lists and other containers)

  • Do i need to import any package for this to work..? – Yogesh D Apr 07 '14 at 07:21
  • None. `set` is a built-in container that achieves fast searching of objects by not storing thier relative positions. The second line is a plain basic Python "list comprehension" operator. –  Apr 07 '14 at 08:57
  • what abt english_dictionary ..? is that supposed to be my list of ignore words..? – Yogesh D Apr 07 '14 at 09:46
  • yes, it is. It can be any collection - a list,a tuple, an open file with one word per row... –  Apr 07 '14 at 10:58
  • Ok...that will be useful when I get the list of words....currently I am using long local list for that in my code....but I wanted something which exist already in python something like `stopwords` from `nltk.corpus` module... but your answer can help me making it fast as have read on many blogs that using set instead of list makes execution faster.... – Yogesh D Apr 07 '14 at 11:30
  • If it is a list it can be turned to set. However, converting a whole dictionary to set will take some time, so you may think of storing ready `set` in file between runs. –  Apr 07 '14 at 16:06
0

this will also work:

yourString = "an elevator is made for five people and it's fast"
wordsToRemove = ["the ", "an ", "and ", "for "]

for word in wordsToRemove:
    yourString = yourString .replace(word, "")
Aleksandar
  • 3,541
  • 4
  • 34
  • 57
  • 2
    Yes that works but that's what I want to avoid doing as I dont want to take my local list as it should contain all the common english words....if it was for 4-5 words then this way seemed correct...I will be doing it using nltk,corpus and stop words list that it offers – Yogesh D Apr 07 '14 at 07:20
0

I have found that what I was looking for is this:

from nltk.corpus import stopwords
my_stop_words = stopwords.words('english')

Now I can remove or replace the words from my list/string where I find the match in my_stop_words which is a list.

For this to work I had to download the NLTK for python and the using its downloader I downloaded stopwords package.

It also contains many other packages which can be used in different situations for NLP like words,brown,wordnet etc.

Yogesh D
  • 1,663
  • 2
  • 23
  • 38