I have a fairly large (~200000) list of strings from which I need to find strings that are similar as defined by a fuzzy edit metric. So ['apple', 'apples'], ['boy', 'boys', 'bay', 'bays'] etc. should be grouped together. I initially used a pandas data frame with threading to do this but I found a NumPy vectorization to work faster. However, it is still taking more than a day. What would be the fastest way to do this (any approach would work)?
base_array = np.array([...])
def fuzzy_ratio(string_A, string_B):
ratio = fuzz.ratio(string_A, string_B)
return ratio
vfunc = np.vectorize(fuzzy_ratio)
for b in tqdm(base_array):
B = np.tile(b, y)
Func_array = np.array(vfunc(base_array, B))
similar_entities = np.argwhere(Func_array)
# extract similar entities for each entity in the base array
The key is to make this work for large arrays as fast as possible. Any suggestion would be welcome