I'm trying to build a language model on the character level with NLTK's KneserNeyInterpolated function. What I have is a frequency list of words in a pandas dataframe, with the only column being it's frequency (the word itself is the index). I've determined, based on the average length of words, that a 9-gram model would be appropriate.
from nltk.lm.models import KneserNeyInterpolated
lm = KneserNeyInterpolated(9)
for i in range(df.shape[0]):
lm.fit([list(ngrams(df.index[i], n = 9))])
lm.generate(num_words = 9)
# ValueError: Can't choose from empty population
Attempt at debugging:
n = 9 # Order of ngram
train_data, padded_sents = padded_everygram_pipeline(4, 'whatisgoingonhere')
model = KneserNeyInterpolated(n)
model.fit(train_data, padded_sents)
model.generate(num_words = 10)
# ['r', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']
This works (I guess?), but I can't seem to extend the functionality to successively training new words to the language model, and I still can't generate realistic words. I feel like I'm missing something basic here on how this module is supposed to work. What has made this a bit difficult is that all tutorials seem to be based on word-level ngrams.