I'm having trouble using NLTK to generate random sentences from a custom corpus.
Before I start, I'd like to mention that I'm using NLTK version 2x, so the "generate" function is still existent.
Here is my current code:
file = open('corpus/romeo and juliet.txt','r')
words = file.read()
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(length=10)
This runs, but does not create random sentences (I'm going for a horse_ebooks vibe). Instead, it returns me the first 10 words of my corpus source every time.
However, if I use NLTK's brown corpus, I get the desired random effect.
text = nltk.Text(nltk.corpus.brown.words())
print text.generate(length=10)
Going into the Brown corpus files, it seems as though every word is separated and tagged with verbs, adjectives, etc - something I thought would be completed with the word_tokenize
function of my first block of code.
Is there a way to generate a corpus like the Brown example - even if it means converting my txt documents into that fashion instead of reading them directly?
Any help would be appreciated - any documents on this are either horribly outdated or just say to use Markov Chains (which I have, but I want to figure this out!) I understand generate()
was removed as of NLTK 3.0 because of bugs.