Java best algorithm to calculate similarity between strings

Asked Sep 16 '19 at 10:47

Active Sep 16 '19 at 11:46

Viewed 1,038 times

I havea big amount of files from different sources and I want to sort out doublicates using the metadata. In order to find duplicates with different spelling(for example concerning letters like ä,é or missing commas, etc), i want to calculate the similarities of the string and report those above a suitable threshold. Can anyone recommend a good algorithm for that comparison?

asked Sep 16 '19 at 10:47

Ginso

1

I would start with levenshtein distance and work your way from there, for just starting out – Rogue Sep 16 '19 at 10:49
`java.text.Normalizer` to decompose accent and just keep the unaccented letter (é to e). `soundex` for similar sounding names, but is language specific and (imho) not ver satisfactory. – Joop Eggen Sep 16 '19 at 12:09

Java best algorithm to calculate similarity between strings

0 Answers0