Questions tagged [nltk-trainer]
55 questions
21
votes
2 answers
What is the preferred ratio between the vocabulary size and embedding dimension?
When using for example gensim, word2vec or a similar method for training your embedding vectors I was wonder what is a good ratio or is there a preferred ratio between the embedding dimension to vocabulary size ?
Also how does that change with more…

Gabriel Bercea
- 1,191
- 1
- 10
- 21
7
votes
1 answer
How to handle with words which have space between characters?
I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word.
For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a…

The Afghan
- 99
- 1
- 7
6
votes
0 answers
Error when installing nltk packages on heroku
I am trying to install nltk packages in heroku using nltk.txt file. int my nltk.txt file only punkt is written. In requirements.txt file nltk is written.
but when push it it shows the errors.
Please help to fix my problem
remote: -----> Python app…

Tulshi Das
- 480
- 3
- 18
6
votes
1 answer
NLTK - Download all nltk data except corpara from command line without Downloader UI
We can download all nltk data using:
> import nltk
> nltk.download('all')
Or specific data using:
> nltk.download('punkt')
> nltk.download('maxent_treebank_pos_tagger')
But I want to download all data except 'corpara' files,
for example - all…

RAVI
- 3,143
- 4
- 25
- 38
5
votes
1 answer
Laplace smoothing function in nltk
I'm building a text generate model using nltk.lm.MLE, I notice they also have nltk.lm.Laplace that I can use to smooth the data to avoid a division by zero, the documentation is https://www.nltk.org/api/nltk.lm.html However, there's no clear…

MeiNan Zhu
- 1,021
- 1
- 9
- 18
5
votes
2 answers
How to train NLTK PunktSentenceTokenizer batchwise?
I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB.
I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it…

JumpinMD
- 53
- 6
5
votes
1 answer
Python NLTK visualization
I am currently doing natural language processing using python NLTK. I want to generate some beautiful graphics of the representation of input. What package can I do to get something like this?

wrek
- 1,061
- 5
- 14
- 26
3
votes
1 answer
No module named 'nltk.lm' in Google colaboratory
I'm trying to import the NLTK language modeling module (nltk.lm) in a Google colaboratory notebook without success. I've tried by installing everything from nltk, still without success.
What mistake or omission could I be making?
Thanks in…

Ramiro Hum-Sah
- 132
- 1
- 6
3
votes
1 answer
Is it possible to modify and run only part of a Python program without having to run all of it again and again?
I have written a Python code to train Brill Tagger from NLTK library on some 8000 English sentences and tag some 2000 sentences.
The Brill Tagger takes many, many hours to train and finally when it finished training, the last statement of the…

singhuist
- 302
- 1
- 6
- 17
2
votes
0 answers
NLTK: How to define the "labeled_featuresets" when creating a ClassifierBasedTagger with nltk?
I am playing around with the nltk right now. I am trying to create various Classifiers with nltk, doing named entity recognition, to compare their results. Creating n-gram Taggers was easy, however I have run into some issues creating a…

Malonga
- 35
- 5
2
votes
2 answers
nltk.org example of Sentence segmentation with Naive Bayes Classifier: how does .sent separate sentences and how does the ML algorithm improve it?
There is an example in nltk.org book (chapter 6) where they use a NaiveBayesian algorithm to classify a punctuation symbol as finishing a sentence or not finishing one...
This is what they do: First they take a corpus and use the .sent method to…

Martin
- 414
- 7
- 21
2
votes
2 answers
Finding matching words with ngrams
Dataset:
df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df[:,0:1]
Id bigram
1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top),
1918916 …

Rajitha Naik
- 103
- 2
- 11
2
votes
1 answer
Python 2.x - How to get the result of the NLTK Naive Bayes classification through a trainSet and a testSet
I'm building a text parser to identify types of crime that contain the texts. My class was built to load the texts of 2 csv files (one file to train and one file to test). The way it was built the methods in my class are for, to make a rapid…

Leandro Santos
- 67
- 1
- 1
- 10
2
votes
3 answers
How to add a custom corpora to local machine in nltk
I have a custom corpora that created with data which i need to do some classification. I have the dataset in a same format that movie_reviews corpora contains. According to nltk documentation i use following code to access to movie_reviews corpora.…

Janitha
- 65
- 9
2
votes
1 answer
How to remove nltk from python and from my system and also from command prompt
I tried downloading nltk by using the command on the python command prompt
import nltk
nltk.download() //after this it started downloading
Now I want to delete all the nltk files from my system, please help in uninstalling and removing all the…

ChÃrming ßoy
- 13
- 1
- 7