2

We've got a system that indexes resume documents in ElasticSearch using the mapper attachment plugin. Alongside the indexed document, I store some basic info, like if it's tied to an applicant or employee, their name, and the ID they're assigned in the system. A query that runs might look something like this when it hits ES:

{
  "size" : 100,
  "query" : {
    "query_string" : {
      "query" : "software AND (developer OR engineer)",
       "default_field" : "fileData"
    }
  },
  "_source" : {
    "includes" : [ "applicant.*", "employee.*" ]
  }
}

And gets me results like:

"hits": [100]
    0:  {
      "_index": "careers"
      "_type": "resume"
      "_id": "AVEW8FJcqKzY6y-HB4tr"
      "_score": 0.4530588
      "_source": {
      "applicant": {
        "name": "John Doe"
        "id": 338338
        }
      }
    }...

What I'm trying to do is limit the results, so that if John Doe with id 338338 has three different resumes in the system that all match the query, I only get back one match, preferably the highest scoring one (though that's not as important, as long as I can find the person). I've been trying different options with filters and aggregates, but I haven't stumbled across a way to do this.

There are various approaches I can take in the app that calls ES to tackle this after I get results back, but if I can do it on the ES side, that would be preferable. Since I'm limiting the query to say, 100 results, I'd like to get back 100 individual people, rather than getting back 100 results and then finding out that 25% of them are docs tied to the same person.

ckasek
  • 23
  • 1
  • 6
  • `applicant.id` is unique yes? Your question has a similar intent as this one: http://stackoverflow.com/questions/35490641/elasticsearch-filter-the-maximum-value-document/35492605#35492605 – IanGabes Feb 19 '16 at 21:02

3 Answers3

2

What you want to do is an aggregation to get the top 100 unique records, and then a sub aggregation asking for the "top_hits". Here is an example from my system. In my example I'm:

  1. setting the result size to 0 because I only care about the aggregations
  2. setting the size of the aggregation to 100
  3. for each aggregation, get the top 1 result

GET index1/type1/_search { "size": 0, "aggs": { "a1": { "terms": { "field": "input.user.name", "size": 100 }, "aggs": { "topHits": { "top_hits": { "size": 1 } } } } } }

jhilden
  • 12,207
  • 5
  • 53
  • 76
2

There's a simpler way to accomplish what @ckasek is looking to do by making use of Elasticsearch's collapse functionality.

Field Collapsing, as described in the Elasticsearch docs:

Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key.

Based on the original query example above, you would modify it like so:

{
  "size" : 100,
  "query" : {
    "query_string" : {
      "query" : "software AND (developer OR engineer)",
       "default_field" : "fileData"
    }
  },
  "collapse": {
    "field": "id",
  },
  "_source" : {
    "includes" : [ "applicant.*", "employee.*" ]
  }
}
LaCroixed
  • 596
  • 4
  • 4
0

Using the answer above and the link from IanGabes, I was able to restructure my search like so:

{
    "size": 0,
    "query": {
        "query_string": {
            "query": "software AND (developer OR engineer)",
            "default_field": "fileData"
        }
    },
    "aggregations": {
        "employee": {
            "terms": {
                "field": "employee.id",
                "size": 100
            },
            "aggregations": {
                "score": {
                    "max": {
                        "script": "scores"
                    }
                }
            }
        },
        "applicant": {
            "terms": {
                "field": "applicant.id",
                "size": 100
            },
            "aggregations": {
                "score": {
                    "max": {
                        "script": "scores"
                    }
                }
            }
        }
    }
}

This gets me back two buckets, one containing all the applicant Ids and the highest score from the matched docs, as well as the same for employees. The script is nothing more than a groovy script on the shard that contains '_score' as the content.

ckasek
  • 23
  • 1
  • 6