6

Problem:

If I search for "iphone" I get 400 product results and the product category aggregation I have returns the top 3 categories in the results set.

Those categories would include smartphones, phone cases and mobile phone accessories.

If I search "iphone 6" I get 1400 results because of the extra "6" returns matches to more products. The product category aggregation now returns the top 3 categories for all those results.

The top 3 product categories will now be everything from cables to computer monitors.

What I need to do is get the top 3 categories for the top 100 results.


What I've tried:

I've tried using the top_hits aggregation within the top category aggregation but that only returns the top products in each category.

Something like this:

{
    "aggs": {

        "product_categories": {
            "terms": {
                "field": "product_category",
                "size": 10,
            }
        }        
        "aggs": {
            "top-categories": {
                "top_hits": {
                    "size" : 3
                }
            }
        }
    }
}

I've also tried creating a top_hits aggregation with a sub-aggregation within to get the top categories but that doesn't work either.

{
    "aggs": {
        "top-categories": {
            "top_hits": {
                "size" : 100
            }
            "aggs": {
                "product_categories": {
                    "terms": {
                        "field": "product_category",
                        "size": 3,
                    }
                }
            }
        }
    }
}

Can anyone help me with this problem?

Ivar
  • 786
  • 2
  • 11
  • 21
  • Same problem here. Ever came to a solution? – khituras Apr 15 '15 at 07:23
  • No, I didn't find a solution that I was happy with. I'm still using the original solution which I described in the question. It's not ideal but until I find a better way it will have to do. I'll let you know if I find anything that works. – Ivar Apr 16 '15 at 10:48
  • Thank you very much. I found something that might help you. For details please refer to the answer I added. – khituras Apr 16 '15 at 14:54

3 Answers3

4

You could try using a filter aggregation based on a limit filter, and nest your terms aggregation in it.

Be aware that the limit is applied at shard level (see the documentation).

However, this should do the job for your case, with a query like :

{
  "aggs": {
    "limit_results": {
      "filter": {
        "limit": {
          "value": 100
        }
      },
      "aggs": {
        "product_categories": {
          "terms": {
            "field": "product_category",
            "size": 10
          }
        }
      }
    }
  }
}
ThomasC
  • 7,915
  • 2
  • 26
  • 26
  • Thanks for the answer, however I get a 400 error when trying your solution. Parse Failure [Found two aggregation type definitions in [limit_results]: – Ivar Mar 19 '15 at 13:11
  • Fixed : I missed the `aggs` wrapping product_categories, my bad. – ThomasC Mar 19 '15 at 13:54
  • Thanks, that fixed the query but it seems I'm not getting the correct categories in the aggregation, anyway thanks for the help, I appreciate it. – Ivar Mar 20 '15 at 13:25
  • The issue with "limits" is that it works without respect to the score, It just stops after the first 100 (if you specified "value":100) documents. I have the very same issue and would be very interested to see a solution. – khituras Apr 15 '15 at 07:22
2

Before I begin, please note that this not a perfect solution to the question. However, it could definitively ease the situation and in a special case it actually is a perfect solution.

The solution I propose goes by sorting the terms aggregation buckets by the score of the document they were found in. That is, the ordering of the terms is no longer only by frequency but also by document score.

Here is an example request:

{
   "query": {
       "query_string": {
           "default_field": "product_title",
           "query": "iphone 6"
       }
   },
   "aggs": {
       "product_categories": {
           "terms": {
               "field": "product_category",
               "order": {
                   "max_score": "desc",
                   "_count": "desc"
               },
               "size": 3
           },
           "aggs": {
               "max_score": {
                   "max": {
                       "script": "_score"
                   }
               }
           }
       }
   }
}

Please note the "order" property of the terms aggregation. It specifies a path to the max_score aggregation which in turn just returns the special _score field which disposes the score of each hit document of the query. It does ALSO use the frequency of each time via the "_count" property on second position.

This request will give you the three terms in the product_category field that are the best of "very frequent and from highly ranked documents". I cannot say more explicitly how the ranking is done. I noticed in preliminary experiments that the result does not monotonously enumerate document scores but may "jump over" a quite highly ranked document when it only includes terms of low frequency - which actually might be what you want for your usecase. The documentation for this kind of ordering is found here: http://www.elastic.co/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html

There is also an example in the above linked documentation for ordering by multiple criteria and just says "The above will sort the countries buckets based on the average height among the female population and then by their doc_count in descending order". My impression was it could be some kind of harmonic mean or something. Perhaps better look for yourself whether you find the results of this approach useful.

The special case I spoke of at the beginning is when each document has exactly one value in the requested field. In this case, you actually get the top N terms for the top N (because N is equal) documents when you leave out the "_count" ordering.

khituras
  • 1,081
  • 10
  • 25
2

You are looking for Sampler Aggregation. I have a similar answer at Aggregation on top n results

{
  "aggs": {
    "bestDocs": {
       "sampler": {
            "shard_size":100
         },
       "aggs": {
          "product_categories": {
             "terms": {
                "field": "product_category",
                "size": 3
             }
          }
       } 
   }
}

It will take the top 100 docs sorted by their scores and then do term aggregation.

Community
  • 1
  • 1
Rahul
  • 15,979
  • 4
  • 42
  • 63