I am working on an implementation of some string distance algorithms using Rcpp, and I am running into an issue. The normal levenshtein distance works fine, but I also want a weighted levenshtein distance with custom substitution costs. These costs are pre-calculated and build a distance matrix:
a b c
a 0 1 2
b 1 0 1.2
c 2 1.2 0
Now, the question is how to access theses substitution costs during the calculation of the levenshtein distance.
What has worked for me for now is to have two structures:
const std::vector<std::vector<double>> M
with the actual costs and
std::unordered_map<std::string, int> phon_map
which basically contains indices for the row/column of each character
and then do:
cost = M[phon_map[str1[i-1]]][phon_map[str2[j-1]]];
But this is not very fast (considerably slower than the regular levenshtein distance).
Other things I tried:
- A map with a pairstd::string,std::string
- A double map: std::unordered_map<std::string, std::unordered_map<std::string, double> >
- A map: std::unordered_map<std::string, double>, but pasting the two strings with some separator: phon_map[str1[i-1] + "__" str2[j-1]]
These are all slower than the current implementation.
Edit:
I tried using a std::vector<double>
for M, and it does seem to speed up things a bit, but I think there is still some slowness in accessing phon_map.