I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words.
I know how to get bigrams and trigrams. For example:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)
However, i found out that scikit-learn can get ngrams with various length. For example I can get ngrams with length from 1 to 5.
v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))
But WordNGramAnalyzer is now deprecated. My question is: How can i get N best word collocations from my text, with collocations length from 1 to 5. Also i need to get FreqList of this collocations/ngrams.
Can i do that with nltk/scikit ? I need to get combinations of ngrams with various lengths from one text ?
For example using NLTK bigrams and trigrams where many situations in which my trigrams include my bitgrams, or my trigrams are part of bigger 4-grams. For example:
bitgrams: hello my trigrams: hello my name
I know how to exclude bigrams from trigrams, but i need better solutions.