1

I'm trying to use Python and NLTK to do text classification on text strings that tend to be only be, on average, 10-20 words in length.

I want to compute word frequencies, and ngrams of size 2-4 and somehow convert those to vectors and use that to build SVN models.

I'm thinking that there might be a very standard NLTK way to do all those things but I'm having trouble finding it.

I'm thinking that the standard way might already be smart about such things as stemming the words (so "Important" and "Importance" would be treated as the same word), dropping out punctuation, super common English words, and might implement a clever way to turn these counts into vectors for me. I'm new to text classification and to python and am open to both suggestions about all of this!

Mahfuz
  • 11
  • 2
  • Please check [how to ask question](https://stackoverflow.com/help/how-to-ask) and [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) Add your code , what have you tried. – Morse Apr 01 '18 at 00:20

1 Answers1

0

Ok, my first ever attempt answering a stack overflow question...

Your question is a bit vague, so I'll try to answer it as best I understand it. It sounds like you're asking how to prepare text prior to building SVN models, specifically how to lemmatize text input, compute word frequencies, and also create n-grams from the given string.

import nltk
from collections import Counter
from nltk import ngrams
from nltk.stem import WordNetLemmatizer


# lowercase, remove punctuation, and lemmatize string
def word_generator(str):
    wnl = WordNetLemmatizer()
    clean = nltk.word_tokenize(str)
    words = [wnl.lemmatize(word.lower()) for word in clean if word.isalpha()]
    for word in words:
        yield word


# create list of freqs
def freq_count(str):
    voc_freq = Counter()
    for word in word_generator(str):
        voc_freq[word] += 1
    trimmed = sorted(voc_freq.items(), reverse=True, key=lambda x: x[1])
    return trimmed


# create n-grams
def make_ngrams(str, n):
    grams = ngrams([word for word in word_generator(str)], n)
    return list(grams)

Example 4-gram output:

>>> my_str = 'This is this string, not A great Strings not the greatest string'

>>> print(freq_count(my_str))
[('string', 3), ('this', 2), ('not', 2), ('is', 1), ('a', 1), ('great', 1), ('the', 1), ('greatest', 1)]

>>> print(make_ngrams(my_str, 4))
[('this', 'is', 'this', 'string'), ('is', 'this', 'string', 'not'), ('this', 'string', 'not', 'a'), ('string', 'not', 'a', 'great'), ('not', 'a', 'great', 'string'), ('a', 'great', 'string', 'not'), ('great', 'string', 'not', 'the'), ('string', 'not', 'the', 'greatest'), ('not', 'the', 'greatest', 'string')]

Then you can do whatever you want with this, such as creating vectors.

dingo_dog
  • 1
  • 2
  • Love the WordNetLemmatizer idea! Is the http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html#sklearn.feature_extraction.DictVectorizer a useful thing? How and when would I use that? I find it confusing. – JSLover Apr 01 '18 at 19:08
  • Would you use the CountVectorizer? If so, how does THAT work? – JSLover Apr 01 '18 at 19:24
  • @JSLover Thanks! I haven't used DictVectorizer, but it looks interesting. As explained in the docs and [this post](https://stackoverflow.com/questions/27473957/understanding-dictvectorizer-in-scikit-learn) combined, is that DictVectorizer creates vectors from dictionaries, in other words, it loads features given a dictionary. [sklearn's feature extraction user guide](http://scikit-learn.org/stable/modules/feature_extraction.html#dict-feature-extraction) also provides a helpful example. – dingo_dog Apr 01 '18 at 20:33
  • @JSLover as for CountVectorizer, [this post](https://stackoverflow.com/questions/22920801/can-i-use-countvectorizer-in-scikit-learn-to-count-frequency-of-documents-that-w) explains it pretty well. As I understand it, you provide vector keys and CountVectorizer transforms a document (or multiple documents) into feature arrays. Pretty cool. And now that I'm aware of them, I'm sure I'll find myself using these classes at some point. – dingo_dog Apr 01 '18 at 20:42