0

I have a fairly large (~200000) list of strings from which I need to find strings that are similar as defined by a fuzzy edit metric. So ['apple', 'apples'], ['boy', 'boys', 'bay', 'bays'] etc. should be grouped together. I initially used a pandas data frame with threading to do this but I found a NumPy vectorization to work faster. However, it is still taking more than a day. What would be the fastest way to do this (any approach would work)?


base_array = np.array([...])

def fuzzy_ratio(string_A, string_B):
  ratio = fuzz.ratio(string_A, string_B)
  return ratio

vfunc = np.vectorize(fuzzy_ratio)

for b in tqdm(base_array):
    B = np.tile(b, y)
    Func_array = np.array(vfunc(base_array, B))
    similar_entities = np.argwhere(Func_array)

# extract similar entities for each entity in the base array

The key is to make this work for large arrays as fast as possible. Any suggestion would be welcome

vineeth venugopal
  • 1,064
  • 1
  • 9
  • 17

0 Answers0