Measuring similarity between two long texts

Question

I'm trying to replicate a Cohen's recent working paper(Lazy Prices).

The main idea of this paper is that 'Firms issuing Financial Disclosure which has low similarity from past year's are likely to show worse performance in average"

To measure similarity, he uses four similarity measures : Cosine, Jaccard, Sim_MinEdit, Sim_Simple.

I believe the first two measures are widely used, hence the methods relevant to them are fairly well established.

However, the last two seem quite ambiguous.

For Sim_MinEdit, he explained that it was computed by counting the smallest number of operations required to transform one document into the other. (For example, from 'We expect demand to increase' to 'We expect weakness in sales' takes, "demand", "to", and "increase" to be deleted, while "weakness", "in", and "sales" should be added.)

It looks very similar to Edit Distance such as Levenshtein distance. However, as far as what I looked for, all the materials about Levenshtein in Internet are computed at 'character level'.

My question is, 'is there any algorithm that calculate those word level similarity using basic principles of Levenshtein?'

Secondly, Sim_Simple uses 'Track Changes' in Microsoft Words or the function 'diff' in Unix/Linux terminal. I found out that Difflib - SequenceMatcher on python does the same job. However, as I'm trying to measure similarity at the word level, I'm using

SequenceMatcher(None, doc1.split(), doc2.split()).ratio())

instead of

SequenceMatcher(None, doc1, doc2).ratio())

where doc1, doc2 are texts.

I know stackoverflow is not a place for this kind of question, but, since I failed to find any relevant information in web myself, stuck here forever, I am looking for some help..!!

I can't offer any insight into the problem, but I would recommend, if you haven't, to contact the author of the paper, it is surprising how helpful people can be. — Aaron Hayman, Sep 26 '18 at 13:59
Might be hard to find Vladimir Levenshtein himself since the paper was published in 1965 ;P Take a look https://drive.google.com/file/d/1lxRclJablHF-veuRzWBgJ9gaqMNo6fPa/view (shameless plug) — alvas, Sep 26 '18 at 14:39
Likely, this question would be close like many https://stackoverflow.com/questions/15173225/calculate-cosine-similarity-given-2-sentence-strings and https://stackoverflow.com/questions/17022691/python-semantic-similarity-score-for-strings and https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings and https://stackoverflow.com/questions/43631533/similarity-between-two-text-documents-in-python — alvas, Sep 26 '18 at 14:40

Measuring similarity between two long texts

0 Answers0