0

I havea big amount of files from different sources and I want to sort out doublicates using the metadata. In order to find duplicates with different spelling(for example concerning letters like ä,é or missing commas, etc), i want to calculate the similarities of the string and report those above a suitable threshold. Can anyone recommend a good algorithm for that comparison?

Ginso
  • 1
  • 4
  • 1
    I would start with levenshtein distance and work your way from there, for just starting out – Rogue Sep 16 '19 at 10:49
  • `java.text.Normalizer` to decompose accent and just keep the unaccented letter (é to e). `soundex` for similar sounding names, but is language specific and (imho) not ver satisfactory. – Joop Eggen Sep 16 '19 at 12:09

0 Answers0