0

I have to find a solution to generate a fast similarity score (a weighted average between jaccard and sorensen-dice similarities) between a person's name and approx 1.5M names divided in 7 CSV lists.

Searching online I found that maybe Elasticsearch could be the tool that I'm looking for, but I would appreciate any feedbacks from anyone who worked on similar problems, and if they used ELK Stack or any other tool.

Any operating hint would be appreciated too. The solution that I have to develop has to return the similarity score of the most similar name (in terms of average of jaccard and dice similarity) with an input name for every list (there are 7) , if a perfect match isn't found, and has to do it in about 0.1s.

The actual solution features a java API that parallelizes the scoring operations after filtering the lists for the first two letters, but it slows down as the workload increases, and eventually it crashes. it has to process up to a peak of 50 searches/second

MLonzo
  • 3
  • 3
  • Just curious, but why are the 1.5M names divided into 7 CSV lists? Loading those lists into and matching the input name with it clearly sounds like something easy that ES could do. – Val Jun 13 '23 at 14:55
  • We cant make a join, since the csv files are 7 distinct sanction lists, with different updating pipelines. Making one of 7 would just make computations longers and more frequent – MLonzo Jun 13 '23 at 17:59
  • Anyway, it could also be 7 different indices that you can easily match against in a single query – Val Jun 13 '23 at 20:36

0 Answers0