0

I'm having trouble using NLTK to generate random sentences from a custom corpus.

Before I start, I'd like to mention that I'm using NLTK version 2x, so the "generate" function is still existent.

Here is my current code:

file = open('corpus/romeo and juliet.txt','r')
words = file.read()
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(length=10)

This runs, but does not create random sentences (I'm going for a horse_ebooks vibe). Instead, it returns me the first 10 words of my corpus source every time.

However, if I use NLTK's brown corpus, I get the desired random effect.

text = nltk.Text(nltk.corpus.brown.words())
print text.generate(length=10)

Going into the Brown corpus files, it seems as though every word is separated and tagged with verbs, adjectives, etc - something I thought would be completed with the word_tokenize function of my first block of code.

Is there a way to generate a corpus like the Brown example - even if it means converting my txt documents into that fashion instead of reading them directly?

Any help would be appreciated - any documents on this are either horribly outdated or just say to use Markov Chains (which I have, but I want to figure this out!) I understand generate() was removed as of NLTK 3.0 because of bugs.

Community
  • 1
  • 1
Alexander Lozada
  • 4,019
  • 3
  • 19
  • 41
  • it will take some time before the devs fix the generate function after they fix the model package, see issues #800 and #736, https://github.com/nltk/nltk/pull/800 and https://github.com/nltk/nltk/issues/736 – alvas Jan 18 '15 at 17:53
  • I know they removed it from 3, hence I'm using version 2. Are you saying that the fact that this code is not working with regular text is due to bugs in the code instead of the fact that it is not formatted like the Brown corpus? – Alexander Lozada Jan 18 '15 at 18:03
  • yes, i think the bug in the smoothing and ngram modelling is causing the generate to not work at all. I might be wrong but why don't you wait for 2-3 months before the issue is fixed. Or you could fix the bug and then push it back up to the repo. It's open source afterall. – alvas Jan 19 '15 at 13:25

0 Answers0