1

I have a large sequence of strings (~ 20,000 strings) and a custom string similarity function. I want to see which strings are similar to which strings. One way to do this is to calculate a similarity matrix using Python and pdist from scipy but this is not feasible because the resulting matrix will be too large. Can I have some advice on how to approach this ?

endeavor
  • 145
  • 4
  • You can short-circuit things a bit by using the triangle inequality. If `a` is close to `b` and `b` is far from `c`, then you don't need to compare `a` to `c` to conclude that `a` is far from `c`. – Stef Feb 20 '23 at 18:37
  • I have thought about that but one of the problem is the runtime. To implement it naively in Python would be a bit slow in my opinion. – endeavor Feb 20 '23 at 18:40
  • 1
    Perhaps group the words into batches. Say you have 200 batches of 100 strings per batch. Then for each batch you compute the complete matrix for this batch. Instead of one matrix 20000x20000, you have 200 matrices 100x100. Then you use the information from these matrices along with the triangle inequality to build "clusters" of words. – Stef Feb 20 '23 at 18:45
  • Also, there are often clever ways to use numpy functions to avoid python loops, so you might not have to do it "naively in python". – Stef Feb 20 '23 at 18:46
  • 1
    Thats a good idea, i will try it. Also, is there any available library that does this because i dont think this is a new problem. – endeavor Feb 20 '23 at 18:52
  • 3
    Take a look at [similar question here](https://stackoverflow.com/questions/67548039/nearest-neighbor-search-over-levenshtein-distance-in-python-using-metric-indexin). Sadly no answers, but question itself contains some useful information. If you could write your distance metric in a way to use strings converted to numeric values, you could use this approach. – dankal444 Feb 20 '23 at 19:16

0 Answers0