comparing "the likes" smartly

Question

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file

Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.

So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".

Task is to pick the closest name to later compare. Unfortunately, identical name is not found.

Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:

_myFile.txt   to   _m_y_f_i_l_e.txt                  0.312
_myFile.txt   to   somethingReallyClever.txt         0.16

So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.

Why?

What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?

In my example, somethingReallyClever.txt should have had a weight of 0.0

I hope i am being clear.

Please share your experience and thoughts on this. (whatever approach you suggest should not depend on number of characters filename consists out of)

possible duplicate of [Word comparison algorithm](http://stackoverflow.com/questions/473522/word-comparison-algorithm) — MartinodF, Nov 10 '10 at 01:09

score 2 · Accepted Answer · edited May 23 '17 at 12:19

Possibly helpful previous question which highlights several possible algorithms:

Word comparison algorithm

These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.

Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

score 0 · Answer 2 · answered Nov 10 '10 at 01:09

0

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

answered Nov 10 '10 at 01:09

I82Much

26,901
13
88
119

comparing "the likes" smartly

2 Answers2