I am following a tutorial about building machine learning systems in Python, and I am modifiying it as I go and trying to classify a new post as belonging to one of 7 different categories.
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
My vectorizer looks like the one below. Among other things, I am trying to test the sensitivity to n_grams of size 4; but I am not sure if that's an optimal parameter or not.
vectorizer = StemmedTfidfVectorizer(min_df = 1, stop_words = 'english', decode_error ='ignore', ngram_range=(1, 4))
My 'new post' to classify gets transformed into a vector, which is then compared to the other vectors that represent the categories where I want to compare my 'new post' vector. Although the classifier is doing a good job for some tags, for some other tags the best category that describes the post is the 2nd highest score, not the first.
I suspect that my problem is the distance metric that I am using to compare vectors, which is a simple Euclidean distance.
def dist_norm(v1, v2):
v1_normalized = v1/sp.linalg.norm(v1.toarray())
v2_normalized = v2/sp.linalg.norm(v2.toarray())
delta = v1_normalized - v2_normalized
return sp.linalg.norm(delta.toarray())
My questions are: 1) Are there other distance metrics that can be used? 2) How can I modify dist_norm to accommodate other distance metrics? 3) For the ML experts out there, is my problem a feature engineering problem or a distance metric problem? I currently have 7 large samples with over 1MM features (using ngram size 4 might be an overkill) 4) Are there any ipython notebook or classic tutorials to follow for text classification into several categories? (For example, a topic that can be classified as both "politics" and "people", or some "fuzzy metric" to choice 2 tags instead of one.
Thanks