Questions tagged [nltk]

The Natural Language Toolkit is a Python library for computational linguistics.

The Natural Language ToolKit (NLTK) is a Python library for computational linguistics. It is currently available for Python versions 2.7 or 3.2+

NLTK includes a great number of common natural language processing tools including a tokenizer, chunker, a part of speech (POS) tagger, a stemmer, a lemmatizer, and various classifiers such as Naive Bayes and Decision Trees. In addition to these tools, NLTK has built in many common corpora including the Brown Corpus, Reuters, and WordNet. The NLTK corpora collection also includes a few non-English corpora in Portuguese, Polish and Spanish.

The book Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein, and Edward Loper is freely available online under the Creative Commons Attribution Noncommercial No Derivative Works 3.0 US Licence. A citable paper NLTK: the natural language ToolKit was first published in 2003 and then again in 2006 for researchers to acknowledge the contribution in ongoing research in Computational Linguistics.

NLTK is currently distributed under an Apache version 2.0 licence.

7139 questions
351
votes
7 answers

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can…
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
202
votes
14 answers

What is the difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was?
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
188
votes
9 answers

What are all possible POS tags of NLTK?

How do I find a list with all possible POS tags used by the Natural Language Toolkit (NLTK)?
OrangeTux
  • 11,142
  • 7
  • 48
  • 73
187
votes
18 answers

Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer... import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ...a LookupError was raised: > LookupError: > ********************************************************************* …
Martin
  • 1,873
  • 2
  • 13
  • 5
187
votes
12 answers

How to check if a word is an English word with Python?

I want to check in a Python program if a word is in the English dictionary. I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task. def is_english_word(word): pass # how to I implement…
Barthelemy
  • 8,277
  • 6
  • 33
  • 36
174
votes
17 answers

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = bigrams(string) print string_bigrams I am aware that…
Shifu
  • 2,115
  • 3
  • 17
  • 15
162
votes
12 answers

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also…
lizarisk
  • 7,562
  • 10
  • 46
  • 70
139
votes
13 answers

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words. I used NLTK to get a list of stop words: from nltk.corpus import stopwords stopwords.words('english') Exactly how do I compare the data to the list of stop words, and thus remove the…
Alex
  • 1,853
  • 5
  • 16
  • 15
134
votes
10 answers

how to check which version of nltk, scikit learn installed?

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script: import nltk echo nltk.__version__ but it stops shell script at import line in linux terminal tried to see in this…
nlper
  • 2,297
  • 7
  • 27
  • 37
131
votes
28 answers

pip issue installing almost any library

I have a difficult time using pip to install almost anything. I'm new to coding, so I thought maybe this is something I've been doing wrong and have opted out to easy_install to get most of what I needed done, which has generally worked. However,…
contentclown
  • 1,341
  • 2
  • 9
  • 8
126
votes
5 answers

re.sub erroring with "Expected string or bytes-like object"

I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function: def fix_Plan(location): letters_only = re.sub("[^a-zA-Z]", # Search for all non-letters " ", …
imanexcelnoob
  • 1,283
  • 2
  • 9
  • 8
123
votes
19 answers

Resource u'tokenizers/punkt/english.pickle' not found

My Code: import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ERROR Message: [ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py Traceback (most recent call last): File "mapper_local_v1.0.py", line 16,…
Supreeth Meka
  • 1,879
  • 2
  • 15
  • 16
112
votes
6 answers

Python: tf-idf-cosine: to find document similarity

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the…
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
108
votes
7 answers

NLTK python error: "TypeError: 'dict_keys' object is not subscriptable"

I'm following instructions for a class homework assignment and I'm supposed to look up the top 200 most frequently used words in a text file. Here's the last part of the code: fdist1 = FreqDist(NSmyText) vocab=fdist1.keys() vocab[:200] But when I…
user3760644
  • 1,137
  • 2
  • 9
  • 6
100
votes
7 answers

How to config nltk data directory from code?

How to config nltk data directory from code?
Juanjo Conti
  • 28,823
  • 42
  • 111
  • 133
1
2 3
99 100