How to improve distance function in python

Question

I am trying to do a classification exercise on email docs (strings containing words).

I defined the distance function as following:

def distance(wordset1, wordset2):

 if len(wordset1) < len(wordset2):
    return len(wordset2) - len(wordset1)
 elif len(wordset1) > len(wordset2):
    return len(wordset1) - len(wordset2)
 elif len(wordset1) == len(wordset2):
    return 0

However, the accuracy in the end is pretty low (0.8). I guess this is because of the not so accurate distance function. How can I improve the function? Or what are other ways to calculate the "distance" between email docs?

Maybe the length of the intersection of both word sets can be a better distance function. — Francisco, Oct 16 '16 at 18:52

Erik Godard · Answer 1 · 2016-10-17T05:04:40.563

2

One common measure of similarity for use in this situation is the Jaccard similarity. It ranges from 0 to 1, where 0 indicates complete dissimilarity and 1 means the two documents are identical. It is defined as

wordSet1 = set(wordSet1)
wordSet2 = set(wordSet2)
sim = len(wordSet1.intersection(wordSet2))/len(wordSet1.union(wordSet2))

Essentially, it is the ratio of the intersection of the sets of words to the ratio of the union of the sets of words. This helps control for emails that are of different sizes while still giving a good measure of similarity.

edited Oct 17 '16 at 05:04

answered Oct 16 '16 at 19:06

Erik Godard

5,930
6
30
33

But then the distance will not be larger the more different the documents are.. Which is helpfull in mapping the documents. Or does that not matter? – trika Oct 16 '16 at 19:23
@trika It all comes down to what is "email distance" to you. If that means difference in size, then yes, this solution suppresses that. However, we usually are interested in semantic dissimilarity (i.e. "are these two emails talking about the same thing?"). In these cases, size doesn't matter at all. – ldavid Oct 16 '16 at 21:31
1

can you divide 2 sets? is there a typo ? `TypeError: unsupported operand type(s) for /: 'set' and 'set'` – Jean-François Fabre Oct 16 '16 at 22:37
You're right, I forgot to take the length of each set. – Erik Godard Oct 17 '16 at 05:04

score 1 · Accepted Answer · edited May 23 '17 at 11:48

You didn't mention the type of wordset1 and wordset2. I'll assume they are both strings.

You defined your distance as the word counting and got a bad score. It's obvious text length is not a good dissimilarity measure: two emails with different sizes can talk about the same thing, while two emails of same size be talking about completely different things.

So, as suggested above, you could try and check for SIMILAR WORDS instead:

import numpy as np

def distance(wordset1, wordset2):
    wordset1 = set(wordset1.split())
    wordset2 = set(wordset2.split())

    common_words = wordset1 & wordset2
    if common_words:
        return 1 / len(common_words) 
    else:
        # They don't share any word. They are infinitely different.
        return np.inf

The problem with that is that two big emails are more likely to share words than two small ones, and this metric would favor those, making them "more similar to each other" in comparison to the small ones. How do we solve this? Well, we normalize the metric somehow:

import numpy as np

def distance(wordset1, wordset2):
    wordset1 = set(wordset1.split())
    wordset2 = set(wordset2.split())

    common_words = wordset1 & wordset2
    if common_words:
        # The distance, normalized by the total 
        # number of different words in the emails.
        return 1 / len(common_words) / (len(wordset1 | wordset2))
    else:
        # They don't share any word. They are infinitely different.
        return np.inf

This seems cool, but completely ignores the FREQUENCY of the words. To account for this, we can use the Bag-of-words model. That is, create a list of all possible words and histogram their appearance in each document. Let's use CountVectorizer implementation from scikit-learn to make our job eaiser:

from sklearn.feature_extraction.text import CountVectorizer

def distance(wordset1, wordset2):
    model = CountVectorizer()
    X = model.fit_transform([wordset1, wordset2]).toarray()

    # uses Euclidean distance between bags.
    return np.linalg.norm(X[0] - X[1])

But now consider two pairs of emails. The emails in the first pair are composed by perfectly written English, full of "small" words (e.g. a, an, is, and, that) necessary for it to be grammatically correct. The emails in the second pair are different: only containing the keywords, it's extremely dry. You see, chances are the first pair will be more similar than the second one. That happens because we are currently accounting for all words the same, while we should be prioritizing the MEANINGFUL words in each text. To do that, let's use term frequency–inverse document frequency. Luckly, there's a very similar implementation in scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer

def distance(wordset1, wordset2):
    model = TfidfVectorizer()
    X = model.fit_transform([wordset1, wordset2]).toarray()

    similarity_matrix = X.dot(X.T)
    # The dissimilarity between samples wordset1 and wordset2.
    return 1-similarity_matrix[0, 1]

Read more about this in this question. Also, duplicate?

You should now have a fairly good accuracy. Try it out. If it's still not as good as you want, then we have to go deeper... (get it? Because... Deep-learning). The first thing is that we need either a dataset to train over or an already trained model. That's required because networks have many parameters that MUST be adjusted in order to provide useful transformations.

What's been missing so far is UNDERSTANDING. We histogrammed the words, striping them from any context or meaning. Instead, let's keep them where they are and try to recognize blocks of patterns. How can this be done?

Embed the words into numbers, which will deal with the different sizes of words.
Pad every number (word embed) sequence to a single length.
Use convolutional networks to extract meaninful features from sequences.
Use fully-connected networks to project the features extracted to a space that minimizes the distance between similar emails and maximizes the distance between non-similar ones.

Let's use Keras to simply our lives. It should look something like this:

# ... imports and params definitions

model = Sequential([
    Embedding(max_features,
              embedding_dims,
              input_length=maxlen,
              dropout=0.2),
    Convolution1D(nb_filter=nb_filter,
                  filter_length=filter_length,
                  border_mode='valid',
                  activation='relu',
                  subsample_length=1),
    MaxPooling1D(pool_length=model.output_shape[1]),
    Flatten(),
    Dense(256, activation='relu'),
])

# ... train or load model weights.

def distance(wordset1, wordset2):
    global model
    # X = ... # Embed both emails.
    X = sequence.pad_sequences(X, maxlen=maxlen)
    y = model.predict(X)
    # Euclidean distance between emails.
    return np.linalg.norm(y[0]-y[1])

There's a practical example on sentence processing which you can check Keras github repo. Also, someone solves this exact same problem using a siamese recurrent network in this stackoverflow question.

Well, I hope this gives you some direction. :-)

Sorry for that! I fixed it in the example. Description of the error: `.fit_transform(...)` method returns a sparse matrix. The dot operation might then result in a `ValueError: dimension mismatch`. To fix it, I changed it to `.fit_transform(...).toarray()`, which returns a dense matrix. :-) — ldavid, Oct 19 '16 at 01:59

How to improve distance function in python

2 Answers2