Another option is the library https://github.com/maxbachmann/rapidfuzz-cpp (I am the author) which includes a similarity algorithm called partial_ratio
.
It performs the following steps:
- searches for the longest common subsequences of the two strings
- Iterates over those subsequences and calculates a normalized Levenshtein distance between the shorter string and a substring of the longer string, with len(shorter string), that starts at the start of the subsequence
- returns score of the optimal alignment
As an example:
#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0
In this example the similarity of the optimal alignment "txst" <-> "test" is calculated. Since only one substitution is required the similarity is 75%.
Getting edit distance instead of relative similarity
As you noted in the comments you are interested in the edit distance instead of relative similarity.
Here is a reimplementation of partial_ratio, that does this:
template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
auto s1_view = rapidfuzz::common::to_string_view(s1);
auto s2_view = rapidfuzz::common::to_string_view(s2);
if (s1_view.empty() && s2_view.empty()) {
return 0;
}
if (s1_view.empty() || s2_view.empty()) {
return -1;
}
// this swap is performed in the original FuzzyWuzzy implementation, so
// I use it in my implementation as well. Depending on your goals you
// might want to remove this swap
if (s1_view.length() > s2_view.length()) {
return partial_ratio(s2_view, s1_view, max);
}
// Note this is a internal API and so it could change at any time
// currently this is the slowest part, since my bitparallel
// implementation of the LCS has a bug and so it is disabled for now
auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);
// when there is a full match exit early
for (const auto& block : blocks) {
if (block.length == s1_view.length()) {
return 0;
}
}
// find the edit distance at the places with the longest common subsequence
std::size_t min_dist = (std::size_t)-1;
for (const auto& block : blocks) {
std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
auto long_substr = s2_view.substr(long_start, s1_view.length());
// Note you could use a different weight here like
// e.g. {1, 1, 1} for the normal Levenshtein distance
// only {1, 1, 1} and {1, 1, 2} have very fast implementations
std::size_t dist = rapidfuzz::string_metric::levenshtein(
s1_view, long_substr, {1, 1, 2}, max);
if (dist < min_dist) {
max = min_dist = dist;
}
}
return min_dist;
}
s1
and s2
can be any type that can be converted to a rapidfuzz::basic_string_view. Some examples are:
- std::basic_string (std::string, ...)
- std::basic_string_view (std::string_string, ...) since C++17
- c strings like char* (this has to find the length once when creating the string_view)
- many other type, that have
data()
and .size()
and use continous memory. Examples are boost::string_view or std::vector
For results with a edit distance above max (size_t)-1) i
is returned instead.