16

I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words.

I know how to get bigrams and trigrams. For example:

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)

However, i found out that scikit-learn can get ngrams with various length. For example I can get ngrams with length from 1 to 5.

v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))

But WordNGramAnalyzer is now deprecated. My question is: How can i get N best word collocations from my text, with collocations length from 1 to 5. Also i need to get FreqList of this collocations/ngrams.

Can i do that with nltk/scikit ? I need to get combinations of ngrams with various lengths from one text ?

For example using NLTK bigrams and trigrams where many situations in which my trigrams include my bitgrams, or my trigrams are part of bigger 4-grams. For example:

bitgrams: hello my trigrams: hello my name

I know how to exclude bigrams from trigrams, but i need better solutions.

artyomboyko
  • 2,781
  • 5
  • 40
  • 54

3 Answers3

20

update

Since scikit-learn 0.14 the format has changed to:

n_grams = CountVectorizer(ngram_range=(1, 5))

Full example:

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

from sklearn.feature_extraction.text import CountVectorizer

c_vec = CountVectorizer(ngram_range=(1, 5))

# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])

# needs to happen after fit_transform()
vocab = c_vec.vocabulary_

count_values = ngrams.toarray().sum(axis=0)

# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
    print(ng_count, ng_text)

which outputs the following (note that the word I is removed not because it's a stopword (it's not) but because of its length: https://stackoverflow.com/a/20743758/):

> (3, u'to')
> (3, u'from')
> (2, u'ngrams')
> (2, u'need')
> (1, u'words')
> (1, u'trigrams but need better solutions')
> (1, u'trigrams but need better')
...

This should/could be much simpler these days, imo. You can try things like textacy, but that can come with its own complications sometimes, like initializing a Doc, which doesn't work currently with v.0.6.2 as shown on their docs. If doc initialization worked as promised, in theory the following would work (but it doesn't):

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

import textacy

# some version of the following line
doc = textacy.Doc([test_str1, test_str2])

ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)

old answer

WordNGramAnalyzer is indeed deprecated since scikit-learn 0.11. Creating n-grams and getting term frequencies is now combined in sklearn.feature_extraction.text.CountVectorizer. You can create all n-grams ranging from 1 till 5 as follows:

n_grams = CountVectorizer(min_n=1, max_n=5)

More examples and information can be found in scikit-learn's documentation about text feature extraction.

arturomp
  • 28,790
  • 10
  • 43
  • 72
Sicco
  • 6,167
  • 5
  • 45
  • 61
  • 3
    If you don't want TF-IDF normalization just use: `CountVectorizer(min_n=1, max_n=5).fit_transform(list_of_strings)`. – ogrisel Aug 01 '12 at 21:23
  • but what do i do next ? how do i get ngrams frequencies ? – artyomboyko Aug 02 '12 at 06:05
  • 3
    @twoface88: `v = CountVectorizer(min_n=1, max_n=5); X = v.fit_transform(["an apple a day keeps the doctor away"]); zip(v.inverse_transform(X)[0], X.A[0])`. Note that stopwords and one-char tokens will be removed by default. – Fred Foo Aug 02 '12 at 08:53
  • 4
    For `CountVectorizer` "DeprecationWarning: Parameters max_n and min_n are deprecated. use ngram_range instead. This will be removed in 0.14" So, `CountVectorizer(ngram_range=(1, 5))` – demongolem Jan 18 '13 at 18:51
  • 1
    In the latest version sklearn changed the format to `n_grams = CountVectorizer(ngram_range=(1, 5))` – Lior Magen Apr 07 '16 at 08:54
8

If you want to generate the raw ngrams (and count them yourself, perhaps), there's also nltk.util.ngrams(sequence, n). It will generate a sequence of ngrams for any value of n. It has options for padding, see the documentation.

alexis
  • 48,685
  • 16
  • 101
  • 161
4

Looking at http://nltk.org/_modules/nltk/util.html I think under the hood nltk.util.bigrams() and nltk.util.trigrams() are implemented using nltk.util.ngrams()

AlgebraWinter
  • 321
  • 2
  • 3