1

How to retrieve only one of a group of similar strings in a list of strings in java.

I have an list of text data (length of list is ~ 60000)(stored as strings) within which there are groups of text which are very similar to each other. From this list I would like to create a new list that only has 1 element for each group of similar list elements

Simplified example:

the boy ate an apple
boy ate apple
the boy ate apple

Should only have 1 of the above in the new list

My general approach is to have 2 lists: the original list and a new list that will hold the unique list

For each text in original_list
    for each utext in the unique list
        if similarity(text, utext) > threshold (threshold can be 90%)
            break
        else
            is_similar = false
    end for

    if is_similar = false   
        add text to unique list
end for

For the similarity function I have used simmetrics Levenshtein distance java library. However I eventually run into java heap space issues even when I increase the jre memory to 6GB

I have also removed stopwords and converted to term vectors using sparse matrices. However, this is very slow.

I do think I can use the override equals() and hashcode() option as since I am fuzzy matching I cannot guarantee that the hashcode() will be equal for strings that are only similar.

Can anyone suggest a more efficient approach to my algorithm please? I am a little rusty with data structures and have been racking my brain and searching online for a solution.

I hope my question is clear. Thanks

morrk
  • 31
  • 4

1 Answers1

2

I used Lucene, as suggested, to index each string and this made the overall performance of checking similarity much better!

I did come across another suggested alternative here that looks like it might have worked but did not try it as I got what I needed from Lucene.

Thanks!

Community
  • 1
  • 1
morrk
  • 31
  • 4