2

I would like to be able to query for text but also retrieve only the results with the maximum value of a certain integer field in my data. I have read the docs about aggregations and filters and I don't quite see what I am looking for.

For instance, I have some repeating data that gets indexed that is the same except for an integer field - let's call this field lastseen.

So, as an example, given this data put into elasticsearch:

  //  these two the same except "lastseen" field
  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
    "lastseen": 1000
  }'

  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
    "somevalue": 100
  }'

  # and these two the same except "lastseen" field
  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "fish chicken something",
    "field2": "dinner",
    "lastseen": 2000
  }'

  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "fish chicken something",
    "field2": "dinner",
    "lastseen": 200
  }'

If I query for "dinner"

  curl -XPOST localhost:9200/myindex -d '{  
   "query": {
        "query_string": {
            "query": "dinner"
        }
    }
    }'

I'll get 4 results back. I'd like to have a filter such that I only get two results back - only the items with the maximum lastseen field.

This is obviously not right, but hopefully it gives you an idea of what I am after:

{
    "query": {
        "query_string": {
            "query": "dinner"
        }
    },
    "filter": {
          "max": "lastseen"
        }

}

The results would look something like:

"hits": [
      {
        ...
        "_source": {
          "field1": "dinner carrot potato broccoli",
          "field2": "something here",
          "lastseen": 1000
        }
      },
      {
        ...
        "_source": {
          "field1": "fish chicken something",
          "field2": "dinner",
          "lastseen": 2000
        }
      } 
   ]

update 1: I tried creating a mapping that excluded lastseen from being indexed. This did not work. Still getting all 4 results back.

curl -XPOST localhost:9200/myindex -d '{  
    "mappings": {
      "myobject": {
        "properties": {
          "lastseen": {
            "type": "long",
            "store": "yes",
            "include_in_all": false
          }
        }
      }
    }
}'

update 2: I tried a deduplication with the agg scheme listed here, and it did not work, but more importantly, I don't see a way to combine that with a keyword search.

Community
  • 1
  • 1
adapt-dev
  • 1,608
  • 1
  • 19
  • 30
  • What if you had two docs with `lastseen: 2000`, you want both returned or one with `lastseen: 2000` and one with `lastseen: 1000`? – Andrei Stefan Jul 22 '15 at 06:06
  • Also, what do you consider as a duplicate document? I see that you recognize this type of docs as the ones having the same `field1`. – Andrei Stefan Jul 22 '15 at 06:09
  • @AndreiStefan a duplicate document would have the same field1 and field2. – adapt-dev Jul 22 '15 at 12:19
  • 1
    Then you can use the approach I described in your other post: http://stackoverflow.com/questions/31553928/elasticsearch-copy-to-field-not-behaving-as-expected-with-aggregations. Use `_source` transformation to concatenate both fields to a `not_analyzed` third field and use that in the aggregation I specified in my answer: `"terms": { "field": "all_fields", "size": 2 }`. – Andrei Stefan Jul 22 '15 at 13:01

1 Answers1

4

Not ideal, but I think it gets you what you need.

Change the mapping of your field1 field, assuming this is the one that you use to define "duplicate" documents, like this:

PUT /lastseen
{
  "mappings": {
    "test": {
      "properties": {
        "field1": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "field2": {
          "type": "string"
        },
        "lastseen": {
          "type": "long"
        }
      }
    }
  }
}

meaning, you add a .raw subfield that is not_analyzed which means it will be indexed just the way it is, no analysis and split into terms. This is to make possible the somewhat "duplicate documents spotting".

Then, you need to use a terms aggregation on field1.raw (for duplicates) and a top_hits sub-aggregation to get a single document for each field1 value:

GET /lastseen/test/_search
{
  "size": 0,
  "query": {
    "query_string": {
      "query": "dinner"
    }
  },
  "aggs": {
    "field1_unique": {
      "terms": {
        "field": "field1.raw",
        "size": 2
      },
      "aggs": {
        "first_one": {
          "top_hits": {
            "size": 1,
            "sort": [{"lastseen": {"order":"desc"}}]
          }
        }
      }
    }
  }
}

Also, that single document returned by top_hits is the one with the highest lastseen (thing made possible by "sort": [{"lastseen": {"order":"desc"}}]).

The results you will get back are these (under aggregations not hits):

   ...
   "aggregations": {
      "field1_unique": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "dinner carrot potato broccoli",
               "doc_count": 2,
               "first_one": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_index": "lastseen",
                           "_type": "test",
                           "_id": "AU60ZObtjKWeJgeyudI-",
                           "_score": null,
                           "_source": {
                              "field1": "dinner carrot potato broccoli",
                              "field2": "something here",
                              "lastseen": 1000
                           },
                           "sort": [
                              1000
                           ]
                        }
                     ]
                  }
               }
            },
            {
               "key": "fish chicken something",
               "doc_count": 2,
               "first_one": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_index": "lastseen",
                           "_type": "test",
                           "_id": "AU60ZObtjKWeJgeyudJA",
                           "_score": null,
                           "_source": {
                              "field1": "fish chicken something",
                              "field2": "dinner",
                              "lastseen": 2000
                           },
                           "sort": [
                              2000
                           ]
                        }
                     ]
                  }
               }
            }
         ]
      }
   }
Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89