1

When issuing the same query with match all query (* : *) I get different clusters and scores all the time. What could be the reason?

First try:

label: "В Минске"
score: 52.79549568196028

Second try:

label: "В Минске"
"score": 54.74385944060893

Third try:

label: "В Минске"
"score": 48.884082925408734

Document ids inside clusters are also different. Clusters themselves change: in one query response I get a cluster "тысячами евро", in the subsequent one it is gone, but new cluster appears: "Тысячами Долларов"

Is there some carrot parameter that could make clusters stable for a given query? Could it be desiredClusterCountBase ?

The Solr index is the same for all cases. Algorithm used: org.carrot2.clustering.lingo.LingoClusteringAlgorithm with StopWordLabelFilter.enabled=false and clustering.rows=1000.

D_K
  • 1,410
  • 12
  • 35
  • I don't know about Carrot or the `LingoClusteringAlgorithm` but in a standard Solr installation the score of the documents is heavily influenced by other documents in the core or in the result set. I am not surprised that you are getting different scores for the same text if each query was ran against a different core. – Hector Correa Nov 01 '18 at 18:37
  • @HectorCorrea good point, but I have the same core => the same static universe of documents. I will next try desiredClusterCountBase option to see if it gives the sought stability – D_K Nov 01 '18 at 20:23
  • The index itself hasn't changed in any way? I.e. no commits/optimizes? – MatsLindh Nov 01 '18 at 20:59
  • @MatsLindh the index was stale. See the answer below with my hypothesis about what happened. – D_K Nov 01 '18 at 21:32

1 Answers1

2

It looks like I found the reason:

  • in the index there were duplicate of each document, with only one difference: one copy had a publication date, the other did not.
  • at the same time, my date filter did not work correctly, because publication dates were incorrectly stamped on each document and ranking function with reciprocal rank could return different documents each time for the top 1000 (this part is hard to debug without looking into Solr source code)
  • clustering module would get slightly different sets of documents => clusters would change. However, one could see that most prominent clusters (by size) were still stable, only scores were changing. Less prominent clusters could be replaced by other less prominent clusters between requests.

I don't know if this is a bug still, but removing all documents from the index and putting them back with the correct publication date has solved the issue.

D_K
  • 1,410
  • 12
  • 35