How to exclude prepositions and conjunctions while tokenizing with nltk?

Question

I have the following sentence:

sentence="The other day I met with Juan and Mary"

And I want to tokenize it but by keeping just the main words, that is: other, day, I, met, Juan, Mary. What I have done so far is tokenize it using nltk library as follows:

tokens=nltk.word_tokenize(sentence)

Which gives me the following:

['The', 'other', 'day', 'I', 'met', 'with', 'Juan', 'and', 'Mary']

I have also tried to tagged the words by using nltk_pos_tag(tokens) obtaining:

[('The', 'DT'), ('other', 'JJ'), ('day', 'NN'), ('I', 'PRP'), ('met', 'VBD'), ('with', 'IN'), ('Juan', 'NNP'), ('and', 'CC'), ('Mary', 'NNP')]

By doing this I could myself delete those words which I don't want as the ones mentioned above as simple as searching their tags and delete the tuple. However, I am wondering if there is a more direct way to do it or if there is a command in nltkthat will do it itself.

Any help would be appreciated! Thank you very much.

Edit: This post doesn't want to eliminate just stopwords but to see the different options one could have to do so as ilustratred above with nltk_pos_tag(tokens).

The words you don't want are called *stopwords*. Look at this link: https://pythonspot.com/nltk-stop-words/ — BoarGules, Apr 03 '18 at 07:48

score 2 · Accepted Answer · answered Apr 03 '18 at 08:04

Like @BoarGules said in comment. It seems like you want to remove stopwords from your sentence. and searching for a direct way to do that so for this i have made a solution for you.

Check this:

import nltk
from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #Have around 900 stopwords
nltk_words = list(stopwords.words('english'))   #Have around 150 stopwords
stop_words.extend(nltk_words)

sentence = "The other day I met with Juan and Mary"   #Your sentence
tokens = nltk.word_tokenize(sentence)
output = []

for words in tokens:
    if not words in stop_words:
        output.append(words)

print output

It gives you output this:

Output:

['The', 'day', 'I', 'met', 'Juan', 'Mary']

Hope this will help you! Thankyou! :)

It helped! Do you know how could I also avoid symbols like "(", "?", punctuation symbosl and so on? — marisa, Apr 03 '18 at 08:06
@marisa: If me solution helps you consider accepting me answer. and for your question use regular expression to find those symbols then exclude/remove them. — Abdullah Ahmed Ghaznavi, Apr 03 '18 at 08:09

How to exclude prepositions and conjunctions while tokenizing with nltk?

1 Answers1