Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312 _myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this. (whatever approach you suggest should not depend on number of characters filename consists out of)