-1

Given two documents, I wish to calculate the similarity between them. I have measures to find out the cosine distance, N-Gram and tf-idf using this: This is a previously asked question

I wish to know, what further needs to be done using these functions.

Also, I have tried implementing Word2Vec, following which I tried finding similarities using the following code:

for i in range(len(Words)):
    print i
    for k in range(len(Words)):
        net_sim = 0.0
        for j in range(len(Words.ix[i]['A'])):
            sim = 0.0
            for l in range(len(Words.ix[k]['A'])):
                if sim < model.similarity(Words.ix[i]['A'][j],Words.ix[k]['A'][l]):
                    sim = model.similarity(Words.ix[i]['A'][j],Words.ix[k]['A'][l])
            net_sim += sim
        Similarity.ix[i][k] = net_sim/len(Words.ix[i]['A'])

For ever word in a given document, I try to find the most similar word in the second document and add their similarity. Then, I divide by the number of words, in order to normalize it to a range of 0 to 1. Here, Words is a DataFrame, consisting of words of different documents, in separate rows and model is a Word2Vec model. This process takes a lot of time and I wish to optimize it and thereby looking for different approaches

Community
  • 1
  • 1
Chinmay Joshi
  • 89
  • 1
  • 9

1 Answers1

1

If you're focused into using these functions described by you, it should be easy to implement by reading nltk wiki, but I don't know if this is the best way to compare the simmilarity between them.

As stated in the difflib docs page, you may also use their package to compare files and sequences.

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs.

For comparing directories and files, see also, the filecmp module.


More specifically, you can use difflib.SequenceMatcher() to compare sequences of text.

Example:

import difflib

# passing strings
difflib.SequenceMatcher(None, str1, str2)

# reading files
difflib.SequenceMatcher(None, file1.read(), file2.read())

For more examples and tutorials, see:

PyMOTW - difflib

dot.Py
  • 5,007
  • 5
  • 31
  • 52
  • No, I need to find the similarity in range 0 to 1. I tried the Word2Vec approach but it takes a lot of time and I need a faster program for bigger data – Chinmay Joshi Jun 21 '16 at 08:46