I have 200,000 strings. I need to find the similar strings among that set. I expect the number of similar strings to be very low in the set. Please help out with an efficient data structure.
I can use a simple hash if I am looking for exact matching strings. But, 'similarity' is custom defined in my case: two strings are treated similar if 80% of the chars in them are same, order does not matter.
I don't want to call the function finding "similarity" ~(200k*100k) times. Any suggestions like techniques to preprocess the strings, efficient data structures are welcome. Thanks.