Questions tagged [nltk-trainer]

55 questions
21
votes
2 answers

What is the preferred ratio between the vocabulary size and embedding dimension?

When using for example gensim, word2vec or a similar method for training your embedding vectors I was wonder what is a good ratio or is there a preferred ratio between the embedding dimension to vocabulary size ? Also how does that change with more…
Gabriel Bercea
  • 1,191
  • 1
  • 10
  • 21
7
votes
1 answer

How to handle with words which have space between characters?

I am using nltk.word_tokenize in Dari language. The problem is that we have space between one word. For example the word "زنده گی" which means life. And the same; we have many other words. All words which end with the character "ه" we have to give a…
The Afghan
  • 99
  • 1
  • 7
6
votes
0 answers

Error when installing nltk packages on heroku

I am trying to install nltk packages in heroku using nltk.txt file. int my nltk.txt file only punkt is written. In requirements.txt file nltk is written. but when push it it shows the errors. Please help to fix my problem remote: -----> Python app…
Tulshi Das
  • 480
  • 3
  • 18
6
votes
1 answer

NLTK - Download all nltk data except corpara from command line without Downloader UI

We can download all nltk data using: > import nltk > nltk.download('all') Or specific data using: > nltk.download('punkt') > nltk.download('maxent_treebank_pos_tagger') But I want to download all data except 'corpara' files, for example - all…
RAVI
  • 3,143
  • 4
  • 25
  • 38
5
votes
1 answer

Laplace smoothing function in nltk

I'm building a text generate model using nltk.lm.MLE, I notice they also have nltk.lm.Laplace that I can use to smooth the data to avoid a division by zero, the documentation is https://www.nltk.org/api/nltk.lm.html However, there's no clear…
MeiNan Zhu
  • 1,021
  • 1
  • 9
  • 18
5
votes
2 answers

How to train NLTK PunktSentenceTokenizer batchwise?

I am trying to split financial documents to sentences. I have ~50.000 documents containing plain English text. The total file size is ~2.6 GB. I am using NLTK's PunktSentenceTokenizer with the standard English pickle file. I additionally tweaked it…
JumpinMD
  • 53
  • 6
5
votes
1 answer

Python NLTK visualization

I am currently doing natural language processing using python NLTK. I want to generate some beautiful graphics of the representation of input. What package can I do to get something like this?
wrek
  • 1,061
  • 5
  • 14
  • 26
3
votes
1 answer

No module named 'nltk.lm' in Google colaboratory

I'm trying to import the NLTK language modeling module (nltk.lm) in a Google colaboratory notebook without success. I've tried by installing everything from nltk, still without success. What mistake or omission could I be making? Thanks in…
Ramiro Hum-Sah
  • 132
  • 1
  • 6
3
votes
1 answer

Is it possible to modify and run only part of a Python program without having to run all of it again and again?

I have written a Python code to train Brill Tagger from NLTK library on some 8000 English sentences and tag some 2000 sentences. The Brill Tagger takes many, many hours to train and finally when it finished training, the last statement of the…
singhuist
  • 302
  • 1
  • 6
  • 17
2
votes
0 answers

NLTK: How to define the "labeled_featuresets" when creating a ClassifierBasedTagger with nltk?

I am playing around with the nltk right now. I am trying to create various Classifiers with nltk, doing named entity recognition, to compare their results. Creating n-gram Taggers was easy, however I have run into some issues creating a…
2
votes
2 answers

nltk.org example of Sentence segmentation with Naive Bayes Classifier: how does .sent separate sentences and how does the ML algorithm improve it?

There is an example in nltk.org book (chapter 6) where they use a NaiveBayesian algorithm to classify a punctuation symbol as finishing a sentence or not finishing one... This is what they do: First they take a corpus and use the .sent method to…
Martin
  • 414
  • 7
  • 21
2
votes
2 answers

Finding matching words with ngrams

Dataset: df['bigram'] = df['Clean_Data'].apply(lambda row: list(ngrams(word_tokenize(row), 2))) df[:,0:1] Id bigram 1952043 [(Swimming,Pool),(Pool,in),(in,the),(the,roof),(roof,top), 1918916 …
Rajitha Naik
  • 103
  • 2
  • 11
2
votes
1 answer

Python 2.x - How to get the result of the NLTK Naive Bayes classification through a trainSet and a testSet

I'm building a text parser to identify types of crime that contain the texts. My class was built to load the texts of 2 csv files (one file to train and one file to test). The way it was built the methods in my class are for, to make a rapid…
Leandro Santos
  • 67
  • 1
  • 1
  • 10
2
votes
3 answers

How to add a custom corpora to local machine in nltk

I have a custom corpora that created with data which i need to do some classification. I have the dataset in a same format that movie_reviews corpora contains. According to nltk documentation i use following code to access to movie_reviews corpora.…
Janitha
  • 65
  • 9
2
votes
1 answer

How to remove nltk from python and from my system and also from command prompt

I tried downloading nltk by using the command on the python command prompt import nltk nltk.download() //after this it started downloading Now I want to delete all the nltk files from my system, please help in uninstalling and removing all the…
1
2 3 4