Tokenize with ngram range

Question

There is any way to tokenize strings with ngram range? Like when u get the features from a CountVectorizer. For example, (w ngram range = (1,2)):

strings = ['this is the first sentence','this is the second sentence']

to

[['this','this is','is','is the','the','the first',''first','first sentence','sentence'],['this','this is','is','is the','the','the second',''second','second sentence','sentence']]

Update: iterating over n i get:

sentence = 'this is the first sentence'

nrange_array = []
    for n in range(1,3):
        nrange = ngrams(sentence.split(),n)
        nrange_array.append(nrange)

for nrange in nrange_array:
    for grams in nrange:
        print(grams)

output:

('this',)
('is',)
('the',)
('first',)
('sentence',)
('this', 'is')
('is', 'the')
('the', 'first')
('first', 'sentence')

and i want:

('this','this is','is','is the','the','the first','first','first sentence','sentence')

I've the tokenized data with 1(sgram) and 2 (bigrams) ngrams (at the level of word). Then, I tried to append them, and get an array with sgrams under the bigrams. So i tried .concat with pandas too, and get the same but in different axis. Now im trying to do a for loop but I think that can exists a better way. — Seba Chamena, Oct 07 '18 at 14:49
Possible duplicate of [n-grams in python, four, five, six grams?](https://stackoverflow.com/questions/17531684/n-grams-in-python-four-five-six-grams) — serges_newj, Oct 07 '18 at 14:56
No. That topic is about getting tokens of ngrams, and I want to get tokens in a range of ngrams, isnt the same. — Seba Chamena, Oct 07 '18 at 14:59
@SebaChamena: there is not much difference, just iterate over the `n`... — Willem Van Onsem, Oct 07 '18 at 15:47
@WillemVanOnsem i updated the question right now, explaining why that doesnt worked for me. — Seba Chamena, Oct 07 '18 at 16:41

score 0 · Accepted Answer · answered Oct 07 '18 at 17:04

0

I hope that code could help you.

x = "this is the first sentence"
words = x.split()
result = []

for index, word in enumerate(words):
      result.append(word)

  if index is not len(words) - 1:
        result.append(" ".join([word, words[index + 1]]))

print(result) # Output: ["this", "this is", ...]

answered Oct 07 '18 at 17:04

hearot

124
9

Excellent! Thanks u. – Seba Chamena Oct 07 '18 at 17:18
@SebaChamena No problem! – hearot Oct 07 '18 at 17:19

Tokenize with ngram range

1 Answers1