1

I store all articles from some news sources. A news article that originates from e.g. Cnn.com, might be reposted by others. In effect I end up saving the same articles many times.

If I do a search for 'Tesla' I might get 3 articles that are 90% equal to each other. I can compare and filter duplicates in my app using the Levenshtein distance, but I rather have ES filtering it.

Is there a way I can say give me all articles matching WORD, but only return the first if other hits are more than 90% equal to the first?

Cheers, Martin

martins
  • 9,669
  • 11
  • 57
  • 85

1 Answers1

1

If you really need to keep all this records in ES (instead of filtering out with levenshtein before indexing), than you're probably looking for top hits aggregations with field collapsing.

Also take a look at this SO question

Community
  • 1
  • 1
Slam
  • 8,112
  • 1
  • 36
  • 44