Choice of distance metric in sklearn.feature_extraction.text - feature engineering

Question

I am following a tutorial about building machine learning systems in Python, and I am modifiying it as I go and trying to classify a new post as belonging to one of 7 different categories.

english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

My vectorizer looks like the one below. Among other things, I am trying to test the sensitivity to n_grams of size 4; but I am not sure if that's an optimal parameter or not.

vectorizer = StemmedTfidfVectorizer(min_df = 1, stop_words = 'english', decode_error ='ignore', ngram_range=(1, 4))

My 'new post' to classify gets transformed into a vector, which is then compared to the other vectors that represent the categories where I want to compare my 'new post' vector. Although the classifier is doing a good job for some tags, for some other tags the best category that describes the post is the 2nd highest score, not the first.

I suspect that my problem is the distance metric that I am using to compare vectors, which is a simple Euclidean distance.

def dist_norm(v1, v2):
    v1_normalized = v1/sp.linalg.norm(v1.toarray())
    v2_normalized = v2/sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

My questions are: 1) Are there other distance metrics that can be used? 2) How can I modify dist_norm to accommodate other distance metrics? 3) For the ML experts out there, is my problem a feature engineering problem or a distance metric problem? I currently have 7 large samples with over 1MM features (using ngram size 4 might be an overkill) 4) Are there any ipython notebook or classic tutorials to follow for text classification into several categories? (For example, a topic that can be classified as both "politics" and "people", or some "fuzzy metric" to choice 2 tags instead of one.

Thanks

scipy.spatial.distance.pdist is an excellent source of distance metrics http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html. I am guessing that a Pearson (Correlation) metric might give better results than Euclidean; but your model might not have the right features to begin with. — Luis Miguel, Oct 13 '14 at 16:43

score 1 · Answer 1 · edited May 23 '17 at 11:49

A very common and efficient metric that you can use instead of Euclidean distance is cosine similarity (http://en.wikipedia.org/wiki/Cosine_similarity).

You can read about an implementation of cosine similarity in python (to replace def dist_norm(v1, v2)) here: Cosine Similarity between 2 Number Lists

As far as I know, when dealing with a classification task, we usually do not have such thing as distance metric problem. As you know there are several standard metrics that are commonly used. Sometimes people use more than one of them, or they use only one with different parameters and compare the results, but in an empirical classification task, we rarely modify these metrics, unless you really want to do a theoretical research on metrics. I think you should look at your problem as a feature engineering task.

For many IR/NLP tasks, Choosing n-grams of size 3 is usually advised at it is large enough to capture some syntactic dependencies, but it is not too large to introduce too much irrelevant information.

Document/text classification is a vast topic. If you want to know about classifying a collection of documents, you should learn about: 1. Text pre-processing 2. (textual) Feature extraction 2. Similarity measures 3. Machine Learning models 4. Evaluation of ML models and Visualisation (optional)

You might already know this, but when you are dealing with text, it is also very useful to learn about Regular Expressions.

Choice of distance metric in sklearn.feature_extraction.text - feature engineering

1 Answers1