0

I am working on an implementation of some string distance algorithms using Rcpp, and I am running into an issue. The normal levenshtein distance works fine, but I also want a weighted levenshtein distance with custom substitution costs. These costs are pre-calculated and build a distance matrix:

  a b c 
a 0 1 2 
b 1 0 1.2
c 2 1.2 0

Now, the question is how to access theses substitution costs during the calculation of the levenshtein distance.

What has worked for me for now is to have two structures:

const std::vector<std::vector<double>> M

with the actual costs and

std::unordered_map<std::string, int> phon_map

which basically contains indices for the row/column of each character

and then do:

cost = M[phon_map[str1[i-1]]][phon_map[str2[j-1]]];

But this is not very fast (considerably slower than the regular levenshtein distance).

Other things I tried:

  • A map with a pairstd::string,std::string
  • A double map: std::unordered_map<std::string, std::unordered_map<std::string, double> >
  • A map: std::unordered_map<std::string, double>, but pasting the two strings with some separator: phon_map[str1[i-1] + "__" str2[j-1]]

These are all slower than the current implementation.

Edit:

I tried using a std::vector<double> for M, and it does seem to speed up things a bit, but I think there is still some slowness in accessing phon_map.

  • Fastest algorithmically or on some particular CPU? It's worth noting that `std::vector` (non-nested) almost always wins. Anything with dynamic allocation is absolutely punishing since those are not always contiguous and that leads to a lot of cache misses. – tadman Sep 09 '22 at 22:30
  • Unless you need a non-rectangular array, [something like this usually works well](https://stackoverflow.com/a/36123944/4581301) – user4581301 Sep 09 '22 at 22:51
  • why would a class make things simpler here? or is the suggestion to use an array instead of vector? – Matías Guzmán Naranjo Sep 09 '22 at 22:53
  • `vector` is basically a pointer to a dynamic allocation. A `vector` is a pointer to a dynamically allocated array of `vector`s which in turn point to many dynamically allocated arrays of data. None of these arrays are necessarily anywhere near each other in memory, so then the CPU starts loading up the cache, it can't just read in a nice straight line. It gets the data for the outer `vector`, then it can us that array to find the and read one of the inner `vector`s and then it has to find and read another `vector` and another and another... This can get staggeringly slow. – user4581301 Sep 09 '22 at 23:09
  • The little hack in Doug's answer linked above uses one `vector` and presents it as though it was 2D. Now the CPU gets a nice predictable path when it loads up the cache. Maybe it even grabs the whole array in one read, and if it doesn't, it doesn't even have to think about where the next place to start reading is. It just keeps on going. – user4581301 Sep 09 '22 at 23:12
  • Stroustrup, the God Emperor of C++, suggests that you A) Don’t store data unnecessarily, B) keep data compact, and C) access memory in a predictable manner if you want fast programs. Not much you can do about A here, but using a 1D array and pretending it is 2D grants you B and C. `vector` drops the ball on both, but not as badly on C as it does on B. – user4581301 Sep 09 '22 at 23:18
  • Does it matter that I am not accessing M linearly at all? – Matías Guzmán Naranjo Sep 09 '22 at 23:20
  • @MatíasGuzmánNaranjo what matters more is if you are accessing it _predictably_, as was said above. Linear is predictable, but so is every odd one. – Taekahn Sep 09 '22 at 23:37
  • What happened to using a `std::map`? This is known in DB circles as an index table. – Thomas Matthews Sep 09 '22 at 23:56

0 Answers0