I have the following sentence:
sentence="The other day I met with Juan and Mary"
And I want to tokenize it but by keeping just the main words, that is: other, day, I, met, Juan, Mary. What I have done so far is tokenize it using nltk
library as follows:
tokens=nltk.word_tokenize(sentence)
Which gives me the following:
['The', 'other', 'day', 'I', 'met', 'with', 'Juan', 'and', 'Mary']
I have also tried to tagged the words by using nltk_pos_tag(tokens)
obtaining:
[('The', 'DT'), ('other', 'JJ'), ('day', 'NN'), ('I', 'PRP'), ('met', 'VBD'), ('with', 'IN'), ('Juan', 'NNP'), ('and', 'CC'), ('Mary', 'NNP')]
By doing this I could myself delete those words which I don't want as the ones mentioned above as simple as searching their tags and delete the tuple. However, I am wondering if there is a more direct way to do it or if there is a command in nltk
that will do it itself.
Any help would be appreciated! Thank you very much.
Edit: This post doesn't want to eliminate just stopwords but to see the different options one could have to do so as ilustratred above with nltk_pos_tag(tokens)
.