15

I've done a complex query using the popularity to improve the results of social media documents using Elasticsearch. The query works really fine and the top results are always centered on the query and with interesting elements.

However it has a problem, for some queries the first results are all from the same user.

I would like to downscore a document if same user was retrieved on a higher document. This way I expect to have more diversification on the results.

Note that I don't want them to be removed, as in some cases it may still be interesting to find more documents of the same user, but I would like them to be in a lower position.

Can anybody suggest a way to make it work?


As suggested in some comments I update a (simplified version) of my query:

query = {"function_score": {
  "functions": [
    {"gauss": {"createdAt":
        {"origin": "now", "scale": "30d", "offset": "7d", "decay" :0.9 } 
    }},
    {"gauss": {"shares.last.twitter_retweets_log":
        {"origin": 4.52, "scale": 2.61, "decay" : 0.9} 
    }},
  ],
  "query": {"bool":{"must":[
    {"exists":{"field": "images"}},
    {"multi_match":{"query": "foo boo", fields:["text", "link.title"]}}
  ]}},
  "score_mode": "multiply"
}};

P.S: some documents that may be interesting, as they talk about diversity, but I'm not sure how to apply:

David Mabodo
  • 745
  • 5
  • 16
  • 2
    Can you show your actual query and some results you're currently getting? Also what is the type of the field describing your user (i.e. string or numeric)? – Val Dec 15 '15 at 04:10
  • @Val I'm using a Function Score Query on elasticSearch 2.1. The user.id is an string. – David Mabodo Dec 15 '15 at 16:11
  • Do you mind sharing your actual query? – Val Dec 15 '15 at 16:12
  • @Val following your suggestion I added a simplified version of it. – David Mabodo Dec 15 '15 at 16:22
  • Thanks. I was going to suggest using `function_score` with `decay` for users also but only in the case where your user id was numeric, which it's not. – Val Dec 15 '15 at 16:23
  • @Val If i'm not wrong even in that way, all the posts from the same user, (so with the same user id) may get downscored by the same factor. So the issue wil remain. – David Mabodo Dec 15 '15 at 16:26

1 Answers1

9

You can couple the sampler with the top_hits aggregation to get diversified results.

{
    "query": {
        "match": {
            "query": "iphone"
        }
    },
    "size":0,
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200,
                "field" : "user.id"                
            },
            "aggs": {
                "diversifiedMatches": {
                    "top_hits": {
                        "size":10
                    }
                }
            }
        }
    }
}

There are some caveats e.g:

1) Deduplication is per-shard not global

2) Choice of diversification field must be a single-value field

3) No support for pagination

4) No support for sorting on anything other than score

Addressing the above issues would be hard and would require expensive/complex co-ordination internally plus more guidance from the client about when and where "duplicate" results can be re-introduced (page 2? page 3? how many?) etc.

MarkH
  • 823
  • 6
  • 10
  • I'm afraid I need some more info to understand this, could you elaborate what are you doing on the sampler, and with the top_hits? And what implications have the choosen values (200 & 10)? – David Mabodo Dec 15 '15 at 16:21
  • Yes, having 200 and 10 in this example may be a little weird. Sampler is filtering to only the 200 top-scoring hits on each shard (with the added restriction that we only consider the best-scoring doc for each unique user.id). Of this 200 doc sample we return the top 10 documents using top_hits. In your use case these numbers should probably be changed to be the same value. Other use cases may require big samples and then smaller results e.g. 200 sample and top 10 significant_terms agg. – MarkH Dec 15 '15 at 16:38
  • If i don't understand bad. This solution implies that you can just see once each user, right? So in a case of just 10 significant results all from the same user, this solution will never show the other 9 results. I'm missing something? – David Mabodo Dec 15 '15 at 17:03
  • On each shard you will get at most 1 doc per user (assuming, as in my example, running with the default `max_docs_per_value` of 1). If you have 5 shards though you may get max 5 results from the same user in the final top 10 as the de-duplication only occurs at shard level – MarkH Dec 15 '15 at 17:31
  • Ah sorry, your question was if there are ONLY results all from the same user. In that case I think it will "fill in" the results with the 9 docs you weren't supposed to have – MarkH Dec 15 '15 at 17:34
  • Just checked - assuming a single shard it won't "fill in" the other 9 results and honours the "max_docs_per_value" setting of only 1 so you only get one result – MarkH Dec 15 '15 at 17:42
  • I was afraid about it, so it doesn't solves my problem at all. However is still the best answer so far. If nobody else replies in the following days I will give you the bounty. – David Mabodo Dec 17 '15 at 10:56
  • @DavidMabodo Did you find a working solution to this problem ? When I say working, I include proper pagination and boost too. – Utkarsh Mishra Jan 11 '17 at 15:55
  • @UtkarshMishra nope, I did the rework internally after getting the results. – David Mabodo Jan 12 '17 at 17:08
  • 1
    I am not seeing `"field"` documented as a property of [`sampler`](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-sampler-aggregation.html) aggregations. Was that in an older version of Elasticsearch, or did you mean to use [`diversified_sampler`](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-diversified-sampler-aggregation.html) which does have a `"field"` property? – Carl G Mar 04 '19 at 16:28
  • 1
    Yes, this looks old syntax. At some point we split the diversification support in `sampler` agg off into the `diversified_sampler` agg – MarkH Mar 05 '19 at 17:47