I have a list of lists where each internal list is a sentence that is tokenized into words:
sentences = [['farmer', 'plants', 'grain'],
['fisher', 'catches', tuna'],
['police', 'officer', 'fights', 'crime']]
Currently I am attempting to compute the nGrams like so:
numSentences = len(sentences)
nGrams = []
for i in range(0, numSentences):
nGrams.append(list(ngrams(sentences, 2)))
This results in finding bigrams of the whole list rather than individual words for each internal list (and it repeats for the number of sentences which is somewhat predictable):
[[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])],
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])],
[(['farmer', 'plants', 'grain'], ['fisher', 'catches', tuna']),
(['fisher', 'catches', tuna'], ['police', 'officer', 'fights', 'crime'])]]
How do I compute the nGrams of each sentence (by word)? In other words, how to I ensure the nGrams don't span multiple list items? Here is my desired output:
farmer plants
plants grain
fisher catches
catches tuna
police officer
officer fights
fights crime