I havea big amount of files from different sources and I want to sort out doublicates using the metadata. In order to find duplicates with different spelling(for example concerning letters like ä,é or missing commas, etc), i want to calculate the similarities of the string and report those above a suitable threshold. Can anyone recommend a good algorithm for that comparison?
Asked
Active
Viewed 1,038 times
0
-
1I would start with levenshtein distance and work your way from there, for just starting out – Rogue Sep 16 '19 at 10:49
-
`java.text.Normalizer` to decompose accent and just keep the unaccented letter (é to e). `soundex` for similar sounding names, but is language specific and (imho) not ver satisfactory. – Joop Eggen Sep 16 '19 at 12:09