2

What is the best algorithm to match or compute the distance between two strings in C# when the order or number of times a word appears is not important?

Best means:

  • Would mostly agree with a human match
  • Elegant
  • Efficient
  • Scalable, so that an input string could be matched to a potentially large collection of other strings

Related questions:

Some notes:

  • Because of the order and occurrence independence, the inputs can be thought of as sets of unique words, not strings in the sense of arrays of characters
  • Not specifically looking for a database solution, although one would be interesting
  • I'm way too old for this to be a homework problem ;)
Community
  • 1
  • 1
Thomas Bratt
  • 48,038
  • 36
  • 121
  • 139

2 Answers2

1

Seach for a method called "Double Metaphone" which I beleive for word per word comparision it is the best available. Counts for different languages as well! queit amazing.

If comparing string maybe you can use this along with a cosine similarity. will yeild perfect results.

Marwan
  • 81
  • 1
  • 3
1

This looks like a canonical case to apply standard information retrieval algorithms. Cosine distance is what first comes to mind, but there might be better matches to your particular case. This is a good link to start digging on that route:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

Implementation example:

How do I calculate the cosine similarity of two vectors?

Community
  • 1
  • 1
Vinko Vrsalovic
  • 330,807
  • 53
  • 334
  • 373