You didn't mention the type of wordset1
and wordset2
. I'll assume they are both strings
.
You defined your distance as the word counting and got a bad score. It's obvious text length is not a good dissimilarity measure: two emails with different sizes can talk about the same thing, while two emails of same size be talking about completely different things.
So, as suggested above, you could try and check for SIMILAR WORDS instead:
import numpy as np
def distance(wordset1, wordset2):
wordset1 = set(wordset1.split())
wordset2 = set(wordset2.split())
common_words = wordset1 & wordset2
if common_words:
return 1 / len(common_words)
else:
# They don't share any word. They are infinitely different.
return np.inf
The problem with that is that two big emails are more likely to share words than two small ones, and this metric would favor those, making them "more similar to each other" in comparison to the small ones. How do we solve this? Well, we normalize the metric somehow:
import numpy as np
def distance(wordset1, wordset2):
wordset1 = set(wordset1.split())
wordset2 = set(wordset2.split())
common_words = wordset1 & wordset2
if common_words:
# The distance, normalized by the total
# number of different words in the emails.
return 1 / len(common_words) / (len(wordset1 | wordset2))
else:
# They don't share any word. They are infinitely different.
return np.inf
This seems cool, but completely ignores the FREQUENCY of the words. To account for this, we can use the Bag-of-words model. That is, create a list of all possible words and histogram their appearance in each document. Let's use CountVectorizer implementation from scikit-learn to make our job eaiser:
from sklearn.feature_extraction.text import CountVectorizer
def distance(wordset1, wordset2):
model = CountVectorizer()
X = model.fit_transform([wordset1, wordset2]).toarray()
# uses Euclidean distance between bags.
return np.linalg.norm(X[0] - X[1])
But now consider two pairs of emails. The emails in the first pair are composed by perfectly written English, full of "small" words (e.g. a
, an
, is
, and
, that
) necessary for it to be grammatically correct. The emails in the second pair are different: only containing the keywords, it's extremely dry. You see, chances are the first pair will be more similar than the second one. That happens because we are currently accounting for all words the same, while we should be prioritizing the MEANINGFUL words in each text. To do that, let's use term frequency–inverse document frequency. Luckly, there's a very similar implementation in scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
def distance(wordset1, wordset2):
model = TfidfVectorizer()
X = model.fit_transform([wordset1, wordset2]).toarray()
similarity_matrix = X.dot(X.T)
# The dissimilarity between samples wordset1 and wordset2.
return 1-similarity_matrix[0, 1]
Read more about this in this question. Also, duplicate?
You should now have a fairly good accuracy. Try it out. If it's still not as good as you want, then we have to go deeper... (get it? Because... Deep-learning). The first thing is that we need either a dataset to train over or an already trained model. That's required because networks have many parameters that MUST be adjusted in order to provide useful transformations.
What's been missing so far is UNDERSTANDING. We histogrammed the words, striping them from any context or meaning. Instead, let's keep them where they are and try to recognize blocks of patterns. How can this be done?
- Embed the words into numbers, which will deal with the different sizes of words.
- Pad every number (word embed) sequence to a single length.
- Use convolutional networks to extract meaninful features from sequences.
- Use fully-connected networks to project the features extracted to a space that minimizes the distance between similar emails and maximizes the distance between non-similar ones.
Let's use Keras to simply our lives. It should look something like this:
# ... imports and params definitions
model = Sequential([
Embedding(max_features,
embedding_dims,
input_length=maxlen,
dropout=0.2),
Convolution1D(nb_filter=nb_filter,
filter_length=filter_length,
border_mode='valid',
activation='relu',
subsample_length=1),
MaxPooling1D(pool_length=model.output_shape[1]),
Flatten(),
Dense(256, activation='relu'),
])
# ... train or load model weights.
def distance(wordset1, wordset2):
global model
# X = ... # Embed both emails.
X = sequence.pad_sequences(X, maxlen=maxlen)
y = model.predict(X)
# Euclidean distance between emails.
return np.linalg.norm(y[0]-y[1])
There's a practical example on sentence processing which you can check Keras github repo. Also, someone solves this exact same problem using a siamese recurrent network in this stackoverflow question.
Well, I hope this gives you some direction. :-)