How to retrieve only one of a group of similar strings in a list of strings in java.
I have an list of text data (length of list is ~ 60000)(stored as strings) within which there are groups of text which are very similar to each other. From this list I would like to create a new list that only has 1 element for each group of similar list elements
Simplified example:
the boy ate an apple
boy ate apple
the boy ate apple
Should only have 1 of the above in the new list
My general approach is to have 2 lists: the original list and a new list that will hold the unique list
For each text in original_list
for each utext in the unique list
if similarity(text, utext) > threshold (threshold can be 90%)
break
else
is_similar = false
end for
if is_similar = false
add text to unique list
end for
For the similarity function I have used simmetrics Levenshtein distance java library. However I eventually run into java heap space issues even when I increase the jre memory to 6GB
I have also removed stopwords and converted to term vectors using sparse matrices. However, this is very slow.
I do think I can use the override equals()
and hashcode()
option as since I am fuzzy matching I cannot guarantee that the hashcode()
will be equal for strings that are only similar.
Can anyone suggest a more efficient approach to my algorithm please? I am a little rusty with data structures and have been racking my brain and searching online for a solution.
I hope my question is clear. Thanks