2

I want to generate char-n-grams of sizes 2 to 4. This is what I have by now:

from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much']

for i in range(len(sentence)):
    for n in range(2, 4):
        n_grams = ngrams(sentence[i].split(), n)
        for grams in n_grams:
            print(grams)

This will give me:

('i', 'have')
('have', 'an')
('an', 'apple')
('i', 'have', 'an')
('have', 'an', 'apple')
('i', 'like')
('like', 'apples')
('apples', 'so')
('so', 'much')
('i', 'like', 'apples')
('like', 'apples', 'so')
('apples', 'so', 'much')

How can I do this in an optimal way? I have a very large entry data and my solution contains for in for so the complexity is a bit huge and it takes a lot of time for the algorithm to finish.

Mr. Wizard
  • 1,093
  • 1
  • 12
  • 19

2 Answers2

1

(Assuming you meant n-gram words instead of char), not sure if there is chances of duplicate sentences but you can try set of input sentences and may be list comprehension:

%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = []
for i in range(len(sentence)):
    for n in range(2, 4):
        for item in ngrams(sentence[i].split(), n):
            n_grams.append(item)

Result:

1000 loops, best of 3: 228 µs per loop

Just using list comprehension, it had some improvement:

%%timeit
from nltk import ngrams
sentence = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]

Result:

1000 loops, best of 3: 214 µs per loop

Other way is to use set and list comprehension:

%%timeit
from nltk import ngrams
sentences = ['i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much','i have an apple', 'i like apples so much', 'i like apples so much', 'i like apples so much',
           'i like apples so much', 'i like apples so much', 'i like apples so much', 'so much']
# use of set
sentence = set(sentences)
n_grams = [item for sent in sentence for n in range(2, 4) for item in ngrams(sent.split(), n)]

Result:

10000 loops, best of 3: 23.5 µs per loop

So, if there are lot of duplicate sentences then, it might help.

niraj
  • 17,498
  • 4
  • 33
  • 48
1
>>> from nltk import everygrams
>>> from collections import Counter

>>> sents = ['i have an apple', 'i like apples so much']

# For character ngrams, use the string directly as 
# the input to `ngrams` or `everygrams`

# If you like to keep the keys as tuple of characters.
>>> Counter(everygrams(sents[0], 1, 4))
Counter({('a',): 3, (' ',): 3, ('e',): 2, ('p',): 2, (' ', 'a'): 2, ('n',): 1, ('v', 'e'): 1, (' ', 'a', 'n'): 1, ('v', 'e', ' '): 1, (' ', 'h', 'a'): 1, ('l', 'e'): 1, ('n', ' '): 1, ('p', 'p', 'l', 'e'): 1, ('e', ' ', 'a'): 1, ('a', 'v', 'e'): 1, ('p', 'l'): 1, ('a', 'v', 'e', ' '): 1, ('a', 'v'): 1, (' ', 'a', 'p'): 1, (' ', 'a', 'p', 'p'): 1, ('h', 'a'): 1, ('i', ' ', 'h', 'a'): 1, ('i',): 1, ('i', ' ', 'h'): 1, ('v', 'e', ' ', 'a'): 1, ('p', 'p', 'l'): 1, ('e', ' '): 1, ('p', 'p'): 1, (' ', 'a', 'n', ' '): 1, ('n', ' ', 'a', 'p'): 1, (' ', 'h', 'a', 'v'): 1, ('a', 'p', 'p', 'l'): 1, ('a', 'n', ' '): 1, (' ', 'h'): 1, ('n', ' ', 'a'): 1, ('a', 'n', ' ', 'a'): 1, ('a', 'p', 'p'): 1, ('h', 'a', 'v'): 1, ('a', 'n'): 1, ('v',): 1, ('h', 'a', 'v', 'e'): 1, ('h',): 1, ('a', 'p'): 1, ('i', ' '): 1, ('p', 'l', 'e'): 1, ('l',): 1, ('e', ' ', 'a', 'n'): 1})

# If you like the keys to be just the string.
>>> Counter(map(''.join,everygrams(sents[0], 1, 4)))
Counter({' ': 3, 'a': 3, ' a': 2, 'e': 2, 'p': 2, 'ppl': 1, 've': 1, ' h': 1, 'i ha': 1, 'an': 1, 'ap': 1, 'have': 1, 'av': 1, 'ave': 1, 'pp': 1, 'le': 1, 'n ap': 1, ' app': 1, ' an': 1, ' ap': 1, 'appl': 1, 'i h': 1, 'app': 1, 'pl': 1, 'an ': 1, 'pple': 1, 'e ': 1, 'e a': 1, 'ple': 1, 'e an': 1, 'i ': 1, 'ha': 1, 'n a': 1, 've a': 1, ' an ': 1, 'i': 1, 'h': 1, 'ave ': 1, 'l': 1, 'n': 1, 'an a': 1, ' hav': 1, 'n ': 1, 've ': 1, 'v': 1, ' ha': 1, 'hav': 1})


# If you want word ngrams:

>>> Counter(map(' '.join,everygrams(sents[0].split(), 1, 4)))
Counter({'have an': 1, 'apple': 1, 'i': 1, 'i have an': 1, 'i have an apple': 1, 'an': 1, 'have': 1, 'have an apple': 1, 'i have': 1, 'an apple': 1})

# Or using word_tokenize
>>> from nltk import word_tokenize
>>> Counter(map(' '.join,everygrams(word_tokenize(sents[0]), 1, 4)))
Counter({'have an': 1, 'apple': 1, 'i': 1, 'i have an': 1, 'i have an apple': 1, 'an': 1, 'have': 1, 'have an apple': 1, 'i have': 1, 'an apple': 1})

If speed is a concern, then Fast n-gram calculation

Complexity of O(MN) is natural here when you have M no. of sentences and N no. of ngrams order to iterate through. Even in everygrams it's iterating through the n-grams order one by one.

I'm sure there are more efficient ways to compute ngrams but I suspect you will run into memory problems more than speed when it comes to ngrams at large scale. In that case, may I suggest https://github.com/kpu/kenlm

alvas
  • 115,346
  • 109
  • 446
  • 738