Finding similar strings in a large list

Question

I have a fairly large (~200000) list of strings from which I need to find strings that are similar as defined by a fuzzy edit metric. So ['apple', 'apples'], ['boy', 'boys', 'bay', 'bays'] etc. should be grouped together. I initially used a pandas data frame with threading to do this but I found a NumPy vectorization to work faster. However, it is still taking more than a day. What would be the fastest way to do this (any approach would work)?


base_array = np.array([...])

def fuzzy_ratio(string_A, string_B):
  ratio = fuzz.ratio(string_A, string_B)
  return ratio

vfunc = np.vectorize(fuzzy_ratio)

for b in tqdm(base_array):
    B = np.tile(b, y)
    Func_array = np.array(vfunc(base_array, B))
    similar_entities = np.argwhere(Func_array)

# extract similar entities for each entity in the base array

The key is to make this work for large arrays as fast as possible. Any suggestion would be welcome

`np.vectorize` is a `for` loop. It even says it in its documentation. You got no speed from that — roganjosh, Mar 10 '23 at 16:56
spaCy and pytorch have similarity and tokenization support if that would help — JonSG, Mar 10 '23 at 17:04
I'd have to have a play but `tile` is expensive for this and there's no clear way to divvy up the work (at least to me) — roganjosh, Mar 10 '23 at 17:05

Finding similar strings in a large list

0 Answers0