I'm trying to replicate a Cohen's recent working paper(Lazy Prices).
The main idea of this paper is that 'Firms issuing Financial Disclosure which has low similarity from past year's are likely to show worse performance in average"
To measure similarity, he uses four similarity measures : Cosine, Jaccard, Sim_MinEdit, Sim_Simple.
I believe the first two measures are widely used, hence the methods relevant to them are fairly well established.
However, the last two seem quite ambiguous.
For Sim_MinEdit, he explained that it was computed by counting the smallest number of operations required to transform one document into the other. (For example, from 'We expect demand to increase' to 'We expect weakness in sales' takes, "demand", "to", and "increase" to be deleted, while "weakness", "in", and "sales" should be added.)
It looks very similar to Edit Distance such as Levenshtein distance. However, as far as what I looked for, all the materials about Levenshtein in Internet are computed at 'character level'.
My question is, 'is there any algorithm that calculate those word level similarity using basic principles of Levenshtein?'
Secondly, Sim_Simple uses 'Track Changes' in Microsoft Words or the function 'diff' in Unix/Linux terminal. I found out that Difflib - SequenceMatcher on python does the same job. However, as I'm trying to measure similarity at the word level, I'm using
SequenceMatcher(None, doc1.split(), doc2.split()).ratio())
instead of
SequenceMatcher(None, doc1, doc2).ratio())
where doc1, doc2 are texts.
I know stackoverflow is not a place for this kind of question, but, since I failed to find any relevant information in web myself, stuck here forever, I am looking for some help..!!